Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: [RFC] Building Regex Alternations Dynamically (updated x2)

by haukex (Archbishop)
on Jan 19, 2017 at 15:18 UTC ( [id://1179921]=note: print w/replies, xml ) Need Help??


in reply to Re: [RFC] Building Regex Alternations Dynamically
in thread Building Regex Alternations Dynamically

Hi Ken,

Thank you very much for your thoughtful reply!

With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things

Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.

In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

Another excellent point, I've switched to using the non-capturing groups in all the examples.

I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.

That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.

The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...

Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:

my @values = qw/ a ab. d ef def g|h /; my $regex_str = join '|', # 4. map {quotemeta} # 3. sort { length $b <=> length $a } # 2. @values; # 1. my $regex = qr/$regex_str/; # 5. print "$regex\n"; # 6.
  1. Then, we join the strings into one long string using the regex alternation operator |. If you want to use this string without the qr// of step 5, note this potential pitfall: For example, if your input is qw/a b c/, then at this point your string will look like $regex_str="a|b|c". Then, saying /^$regex_str$/ will be interpolated to /^a|b|c$/, which means "match a only at the beginning of the string, or b anywhere in the string, or c only at the end of the string", which is probably not what you meant, you probably meant /^(?:a|b|c)$/, that is /^(?:$regex_str)$/!
  2. Finally, we compile the regular expression using qr//. This is not strictly necessary, you could just interpolate the string you've just created into a regex, but I prefer to turn them into regex objects explicitly. It also has the advantages that you can apply modifiers such as /i to the regex in a (IMO) more natural way, and that qr// implicitly adds a non-capturing group (?:...) around the regex, which takes care of the problem described in step 4 above.
  3. When we print the regular expression, we see that it has become this:
    (?^:ab\.|def|g\|h|ef|a|d)
    You can now use this precompiled regular expression anywhere, as explained in Compiling and saving regular expressions and perlop, such as if ($input=~$regex) { ... } or while ($input=~/$regex/g) { ... }.

Thanks,
-- Hauke D

Replies are listed 'Best First'.
Re^3: [RFC] [OT] Building Regex Alternations Dynamically (updated)
by AnomalousMonk (Archbishop) on Jan 19, 2017 at 20:34 UTC

    (The following is perhaps a bit OT to the main thread, or else already touched upon in an update. Oh, well...)

    ... I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be ... those are just stylistic concerns ...

    Please see Re: Recognizing 3 and 4 digit number and thereunder for a long discussion between myself and kcott on these "stylistic concerns." Personally, I still don't see the need for the extra explicit  (?:...) wrap step. The implicit wrap becomes explicit quick enough if you print the stringized Regexp object, and this feature of a Regexp object should be deeply understood from the moment one begins to use them.

    ... skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... ... it's more robust ...

    To continue the previous point, I feel it's important to get a regex into a Regexp form as quickly as possible: no dilly-dallying. Once objectified, it can be used atomically when composing more complex regex expressions:

    my $rx = qr{ ... }xms; ... $string =~ m{ $rx* $rx+? $rx{3,4} }xms and do_something();
    (Of course, this compositional capability is also addressed by the  (DEFINE) predicate of the  (?(condition)yes-pattern) conditional expression of Perl 5.10+.)

    The only situation which I'm aware of in which this compositional atomicity breaks down is for something like

    my ($n, $m) = (3, 4); ... $string =~ m{ $rx{$n,$m} }xms and do_something();
    where  $rx{$n} $rx{$n,} $rx{$n,$m} are all taken as hash element accesses. This can be fixed simply by an explicit layer of non-capturing group wrapping (entirely necessary here!):
        (?:$rx){$n} (?:$rx){$n,} (?:$rx){$n,$m}


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1179921]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found