Tidying and simplifying a regular expression

Dallaylaen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Tidying and simplifying a regular expression by LanX (Saint) on Dec 08, 2017 at 17:08 UTC
I'm not aware of a `Regex::Tidy` , but I have an inspiration how to go `use re 'debug'` to decompile the regex parse op codes into tree structure apply rewrite rules for simplification rebuild standardized regex from tree in your case you'd need rules to eliminate idempotence Edit: for point 1 compare Parsing and translating Perl Regexes Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Tidying and simplifying a regular expression by AnomalousMonk (Archbishop) on Dec 08, 2017 at 18:36 UTC
One reason (maybe the only reason) that the stringization of a `qr//` object comes wrapped in its own little non-capturing group is so that the further interpolation of something like `my $rx = qr{ ... }xms; my $ry = qr{ ... }xms; my $rz = qr{ ... }xms; if ($string =~ m{ \A $rx* $ry+ $rz{2,5} \z }xms) { ... }` [download] can work intuitively — even the `$rz{2,5}` bit, surprisingly, although you can only push that one so far. This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Tidying and simplifying a regular expression (flags) by LanX (Saint) on Dec 10, 2017 at 02:13 UTC
> One reason (maybe the only reason) that the stringization of a qr// object comes wrapped in its own little non-capturing group is so that the further interpolation of something like > ( examples with appended quantifiers ) Not really. `perlre` is actually quite explicit about the why > > > The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (d-imnsx ), modified by any flags specified. In other words: It's about preserving the flags of the embedded regex and assuming default if none are specified. Update demonstration `DB<7> $U=qr/U/ # always upper case DB<8> $i=qr/i${U}i/i # surrounding case insensitive DB<9> p $i (?^ui:i(?^u:U)i) DB<10> p 'iui' =~ $i DB<11> p 'iUi' =~ $i 1 DB<12> p 'IUI' =~ $i 1 DB<13> p 'IuI' =~ $i DB<14> p join "\n", grep { $_ =~ $i } <{i,I}{u,U}{i,I}> iUi iUI IUi IUI` [download] Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: Tidying and simplifying a regular expression by AnomalousMonk (Archbishop) on Dec 10, 2017 at 05:02 UTC
It's about preserving the flags of the embedded regex ... Yes, and the reason that is done is, at least in part, to make composition of relatively more complex regexes from simpler `qr//` components (via interpolation) work "right." Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Tidying and simplifying a regular expression (interpolation) by LanX (Saint) on Dec 10, 2017 at 22:22 UTC
Re^2: Tidying and simplifying a regular expression by Dallaylaen (Chaplain) on Dec 09, 2017 at 07:48 UTC
This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it? Not exactly a common use case, or that should've been built in... I can come up with two examples: Defining a regular expression constant as a combination of smaller constants; Compiling a regex from user data and caching it somewhere. In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged.	[reply]
Re^3: Tidying and simplifying a regular expression (debugging) by LanX (Saint) on Dec 09, 2017 at 14:41 UTC
> In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged. It's the other way round, you can trace the past `qr` compilation steps, which actually helps debugging. You are complaining about the verbosity of debugging informations, but you have to admit that your example is a very constructed edge case. AnomalousMonk is right to ask for common cases where this becomes a problem. I can see the point for a `regex::tidy` but this alone is not a very convincing incentive. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l]
Re: Tidying and simplifying a regular expression by tinita (Parson) on Dec 08, 2017 at 19:05 UTC
Because I recently learned about `regexp_pattern` (see thread), this might work, but it might not handle every case (there are other flags besides "u"): `use re (); for (1..10) { my ($pat, $flags) = re::regexp_pattern($rex); $rex = ($flags eq "u" or $flags eq "") ? qr{$pat.} : qr{$rex.}; }` [download]	[reply] [d/l] [select]
Re: Tidying and simplifying a regular expression (opcode) by LanX (Saint) on Dec 08, 2017 at 22:42 UTC
Is there really an issue? When I run your regex and a simplified form through `use re 'debug'` I'm getting the same Regex-opcodes: C:/Perl_64/bin\perl.exe d:/Users/RL/pm/re_tidy.pl Compiling REx "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).). +).).)"... Final program: 1: REG_ANY (2) 2: REG_ANY (3) 3: REG_ANY (4) 4: REG_ANY (5) 5: REG_ANY (6) 6: REG_ANY (7) 7: REG_ANY (8) 8: REG_ANY (9) 9: REG_ANY (10) 10: REG_ANY (11) 11: REG_ANY (12) 12: END (0) minlen 11 (?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).).).).).).) +) at d:/Users/RL/pm/re_tidy.pl line 27. Freeing REx: "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).) +.).)"... Compiling REx "(?^:...........)" # simplified: 11 dots in a row Final program: 1: REG_ANY (2) 2: REG_ANY (3) 3: REG_ANY (4) 4: REG_ANY (5) 5: REG_ANY (6) 6: REG_ANY (7) 7: REG_ANY (8) 8: REG_ANY (9) 9: REG_ANY (10) 10: REG_ANY (11) 11: REG_ANY (12) 12: END (0) minlen 11 (?^:(?^:...........)) at d:/Users/RL/pm/re_tidy.pl line 27. Freeing REx: "(?^:...........)" Compilation finished at Fri Dec 8 23:40:00 [download] Apparently while the stringification may differ, the resulting code is identical. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Tidying and simplifying a regular expression by ikegami (Patriarch) on Dec 12, 2017 at 17:21 UTC
It's relatively easy to spot that it's just a (?:..........) It could just as easily be two `(?:...)` back to back. Therefore, tidying this up requires a complete regexp parser. And since Perl regex patterns can contain arbitrary Perl code, you also need a complete Perl parser if the pattern includes `(?{...})` or `(??{...})`. So no, it's immensely hard to spot that it's just `(?:...)`. It might be easier to address the problem at the source, which means replacing `my $augmented_pattern = "$re"; ... my $re = qr/$augmented_pattern/;` [download] with `my ($pattern, $mods) = re::regexp_pattern($re); ... my $re = eval("qr/\$pattern/$mods") or die $@;` [download]	[reply] [d/l] [select]
Re^2: Tidying and simplifying a regular expression by Dallaylaen (Chaplain) on Dec 13, 2017 at 07:13 UTC
You are right, "easy to spot" was a bit of exaggeration. However, most of the time (like 90%) the augmented pattern is really not going to contain much more than or's, concatenations, multipliers, and capture groups. So one could get away with much less than a full-blown Perl parser.	[reply]


go ahead... be a heretic
	PerlMonks

Tidying and simplifying a regular expression

Update demonstration