Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Tidying and simplifying a regular expression

by Dallaylaen (Chaplain)
on Dec 08, 2017 at 17:00 UTC ( [id://1205185]=perlquestion: print w/replies, xml ) Need Help??

Dallaylaen has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks and nuns,

I'm just wondering if there is a module or recipe to strip a regular expression of meaningless grouping. Consider the following code:

bash$ perl -wle 'my $rex = qr/./; $rex = qr/$rex./ for 1..10; print $r +ex;' (?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).).).).).).)

It's relatively easy to spot that it's just a (?:..........), however, the expression is not stringified exactly like that. Is there a way possible to tidy it up automatically?

Inspired by this node, but I think it would be nice to have a simplifier anyway...

Replies are listed 'Best First'.
Re: Tidying and simplifying a regular expression
by LanX (Saint) on Dec 08, 2017 at 17:08 UTC
    I'm not aware of a Regex::Tidy , but I have an inspiration how to go

    1. use re 'debug' to decompile the regex
    2. parse op codes into tree structure
    3. apply rewrite rules for simplification
    4. rebuild standardized regex from tree

    in your case you'd need rules to eliminate idempotence

    Edit: for point 1 compare Parsing and translating Perl Regexes

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: Tidying and simplifying a regular expression
by AnomalousMonk (Archbishop) on Dec 08, 2017 at 18:36 UTC

    One reason (maybe the only reason) that the stringization of a  qr// object comes wrapped in its own little non-capturing group is so that the further interpolation of something like

    my $rx = qr{ ... }xms; my $ry = qr{ ... }xms; my $rz = qr{ ... }xms; if ($string =~ m{ \A $rx* $ry+ $rz{2,5} \z }xms) { ... }
    can work intuitively — even the  $rz{2,5} bit, surprisingly, although you can only push that one so far.

    This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it?


    Give a man a fish:  <%-{-{-{-<

      > One reason (maybe the only reason) that the stringization of a  qr//  object comes wrapped in its own little non-capturing group is so that the further interpolation of something like

      > ( examples with appended quantifiers )

      Not really.

      perlre is actually quite explicit about the why

      > > > The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (d-imnsx ), modified by any flags specified.

      In other words: It's about preserving the flags of the embedded regex and assuming default if none are specified.

      Update demonstration

      DB<7> $U=qr/U/ # always upper case DB<8> $i=qr/i${U}i/i # surrounding case insensitive DB<9> p $i (?^ui:i(?^u:U)i) DB<10> p 'iui' =~ $i DB<11> p 'iUi' =~ $i 1 DB<12> p 'IUI' =~ $i 1 DB<13> p 'IuI' =~ $i DB<14> p join "\n", grep { $_ =~ $i } <{i,I}{u,U}{i,I}> iUi iUI IUi IUI

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Wikisyntax for the Monastery

        It's about preserving the flags of the embedded regex ...

        Yes, and the reason that is done is, at least in part, to make composition of relatively more complex regexes from simpler  qr// components (via interpolation) work "right."


        Give a man a fish:  <%-{-{-{-<

      This node notwithstanding, what would one gain in the long run from a "simplified" form for the example given? How often does one build a regex in this way and then try to read it?

      Not exactly a common use case, or that should've been built in... I can come up with two examples:

      • Defining a regular expression constant as a combination of smaller constants;
      • Compiling a regex from user data and caching it somewhere.

      In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged.

        > In both cases trying to print the resulting expression for debugging leads to something incomprehensible that itself needs to be debugged.

        It's the other way round, you can trace the past qr compilation steps, which actually helps debugging.

        You are complaining about the verbosity of debugging informations, but you have to admit that your example is a very constructed edge case.

        AnomalousMonk is right to ask for common cases where this becomes a problem.

        I can see the point for a regex::tidy but this alone is not a very convincing incentive.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

Re: Tidying and simplifying a regular expression
by tinita (Parson) on Dec 08, 2017 at 19:05 UTC
    Because I recently learned about regexp_pattern (see thread), this might work, but it might not handle every case (there are other flags besides "u"):
    use re (); for (1..10) { my ($pat, $flags) = re::regexp_pattern($rex); $rex = ($flags eq "u" or $flags eq "") ? qr{$pat.} : qr{$rex.}; }
Re: Tidying and simplifying a regular expression (opcode)
by LanX (Saint) on Dec 08, 2017 at 22:42 UTC
    Is there really an issue?

    When I run your regex and a simplified form through use re 'debug' I'm getting the same Regex-opcodes:

    C:/Perl_64/bin\perl.exe d:/Users/RL/pm/re_tidy.pl Compiling REx "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).). +).).)"... Final program: 1: REG_ANY (2) 2: REG_ANY (3) 3: REG_ANY (4) 4: REG_ANY (5) 5: REG_ANY (6) 6: REG_ANY (7) 7: REG_ANY (8) 8: REG_ANY (9) 9: REG_ANY (10) 10: REG_ANY (11) 11: REG_ANY (12) 12: END (0) minlen 11 (?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).).).).).).) +) at d:/Users/RL/pm/re_tidy.pl line 27. Freeing REx: "(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).) +.).)"... Compiling REx "(?^:...........)" # simplified: 11 dots in a row Final program: 1: REG_ANY (2) 2: REG_ANY (3) 3: REG_ANY (4) 4: REG_ANY (5) 5: REG_ANY (6) 6: REG_ANY (7) 7: REG_ANY (8) 8: REG_ANY (9) 9: REG_ANY (10) 10: REG_ANY (11) 11: REG_ANY (12) 12: END (0) minlen 11 (?^:(?^:...........)) at d:/Users/RL/pm/re_tidy.pl line 27. Freeing REx: "(?^:...........)" Compilation finished at Fri Dec 8 23:40:00

    Apparently while the stringification may differ, the resulting code is identical.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: Tidying and simplifying a regular expression
by ikegami (Patriarch) on Dec 12, 2017 at 17:21 UTC

    It's relatively easy to spot that it's just a (?:..........)

    It could just as easily be two `(?:...)` back to back. Therefore, tidying this up requires a complete regexp parser. And since Perl regex patterns can contain arbitrary Perl code, you also need a complete Perl parser if the pattern includes `(?{...})` or `(??{...})`.

    So no, it's immensely hard to spot that it's just `(?:...)`.

    It might be easier to address the problem at the source, which means replacing

    my $augmented_pattern = "$re"; ... my $re = qr/$augmented_pattern/;
    with
    my ($pattern, $mods) = re::regexp_pattern($re); ... my $re = eval("qr/\$pattern/$mods") or die $@;

      You are right, "easy to spot" was a bit of exaggeration. However, most of the time (like 90%) the augmented pattern is really not going to contain much more than or's, concatenations, multipliers, and capture groups. So one could get away with much less than a full-blown Perl parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1205185]
Front-paged by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-19 23:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found