Re: Tidying and simplifying a regular expression
by LanX (Saint) on Dec 08, 2017 at 17:08 UTC
|
I'm not aware of a Regex::Tidy , but I have an inspiration how to go
- use re 'debug' to decompile the regex
- parse op codes into tree structure
- apply rewrite rules for simplification
- rebuild standardized regex from tree
in your case you'd need rules to eliminate idempotence
Edit: for point 1 compare Parsing and translating Perl Regexes
| [reply] [d/l] [select] |
Re: Tidying and simplifying a regular expression
by AnomalousMonk (Archbishop) on Dec 08, 2017 at 18:36 UTC
|
One reason (maybe the only reason) that the stringization of a qr// object comes wrapped in its own little non-capturing group is so that the further interpolation of something like
my $rx = qr{ ... }xms;
my $ry = qr{ ... }xms;
my $rz = qr{ ... }xms;
if ($string =~ m{ \A $rx* $ry+ $rz{2,5} \z }xms) {
...
}
can work intuitively — even the $rz{2,5} bit, surprisingly, although you can only push that one so far.
This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
> One reason (maybe the only reason) that the stringization of a qr// object comes wrapped in its own little non-capturing group is so that the further interpolation of something like
> ( examples with appended quantifiers )
Not really.
perlre is actually quite explicit about the why
> > > The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (d-imnsx ), modified by any flags specified.
In other words: It's about preserving the flags of the embedded regex and assuming default if none are specified.
Update demonstration
DB<7> $U=qr/U/ # always upper case
DB<8> $i=qr/i${U}i/i # surrounding case insensitive
DB<9> p $i
(?^ui:i(?^u:U)i)
DB<10> p 'iui' =~ $i
DB<11> p 'iUi' =~ $i
1
DB<12> p 'IUI' =~ $i
1
DB<13> p 'IuI' =~ $i
DB<14> p join "\n", grep { $_ =~ $i } <{i,I}{u,U}{i,I}>
iUi
iUI
IUi
IUI
| [reply] [d/l] |
|
It's about preserving the flags of the embedded regex ...
Yes, and the reason that is done is, at least in part, to make composition of relatively more complex regexes from simpler qr// components (via interpolation) work "right."
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it?
Not exactly a common use case, or that should've been built in... I can come up with two examples:
-
Defining a regular expression constant as a combination of smaller constants;
-
Compiling a regex from user data and caching it somewhere.
In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged.
| [reply] |
|
> In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged.
It's the other way round, you can trace the past qr compilation steps, which actually helps debugging.
You are complaining about the verbosity of debugging informations, but you have to admit that your example is a very constructed edge case.
AnomalousMonk is right to ask for common cases where this becomes a problem.
I can see the point for a regex::tidy but this alone is not a very convincing incentive.
| [reply] [d/l] |
Re: Tidying and simplifying a regular expression
by tinita (Parson) on Dec 08, 2017 at 19:05 UTC
|
Because I recently learned about regexp_pattern (see thread), this might work, but it might not handle every case (there are other flags besides "u"):
use re ();
for (1..10) {
my ($pat, $flags) = re::regexp_pattern($rex);
$rex = ($flags eq "u" or $flags eq "") ? qr{$pat.} : qr{$rex.};
}
| [reply] [d/l] [select] |
Re: Tidying and simplifying a regular expression (opcode)
by LanX (Saint) on Dec 08, 2017 at 22:42 UTC
|
Is there really an issue?
When I run your regex and a simplified form through use re 'debug' I'm getting the same Regex-opcodes:
C:/Perl_64/bin\perl.exe d:/Users/RL/pm/re_tidy.pl
Compiling REx "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).
+).).)"...
Final program:
1: REG_ANY (2)
2: REG_ANY (3)
3: REG_ANY (4)
4: REG_ANY (5)
5: REG_ANY (6)
6: REG_ANY (7)
7: REG_ANY (8)
8: REG_ANY (9)
9: REG_ANY (10)
10: REG_ANY (11)
11: REG_ANY (12)
12: END (0)
minlen 11
(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).).).).).).)
+) at d:/Users/RL/pm/re_tidy.pl line 27.
Freeing REx: "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).)
+.).)"...
Compiling REx "(?^:...........)" # simplified: 11 dots in a row
Final program:
1: REG_ANY (2)
2: REG_ANY (3)
3: REG_ANY (4)
4: REG_ANY (5)
5: REG_ANY (6)
6: REG_ANY (7)
7: REG_ANY (8)
8: REG_ANY (9)
9: REG_ANY (10)
10: REG_ANY (11)
11: REG_ANY (12)
12: END (0)
minlen 11
(?^:(?^:...........)) at d:/Users/RL/pm/re_tidy.pl line 27.
Freeing REx: "(?^:...........)"
Compilation finished at Fri Dec 8 23:40:00
Apparently while the stringification may differ, the resulting code is identical.
| [reply] [d/l] [select] |
Re: Tidying and simplifying a regular expression
by ikegami (Patriarch) on Dec 12, 2017 at 17:21 UTC
|
It's relatively easy to spot that it's just a (?:..........)
It could just as easily be two `(?:...)` back to back. Therefore, tidying this up requires a complete regexp parser. And since Perl regex patterns can contain arbitrary Perl code, you also need a complete Perl parser if the pattern includes `(?{...})` or `(??{...})`.
So no, it's immensely hard to spot that it's just `(?:...)`.
It might be easier to address the problem at the source, which means replacing
my $augmented_pattern = "$re";
...
my $re = qr/$augmented_pattern/;
with
my ($pattern, $mods) = re::regexp_pattern($re);
...
my $re = eval("qr/\$pattern/$mods") or die $@;
| [reply] [d/l] [select] |
|
You are right, "easy to spot" was a bit of exaggeration. However, most of the time (like 90%) the augmented pattern is really not going to contain much more than or's, concatenations, multipliers, and capture groups. So one could get away with much less than a full-blown Perl parser.
| [reply] |