comment on

Thank you very much for your thoughtful reply!

With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things

Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.

In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

Another excellent point, I've switched to using the non-capturing groups in all the examples.

I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.

That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.

The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...

Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:

my @values = qw/ a ab. d ef def g|h /;
my $regex_str = join '|',             # 4.
    map {quotemeta}                   # 3.
    sort { length $b <=> length $a }  # 2.
    @values;                          # 1.
my $regex = qr/$regex_str/;           # 5.
print "$regex\n";                     # 6.
[download]

Then, we join the strings into one long string using the regex alternation operator |. If you want to use this string without the qr// of step 5, note this potential pitfall: For example, if your input is qw/a b c/, then at this point your string will look like $regex_str="a|b|c". Then, saying /^$regex_str$/ will be interpolated to /^a|b|c$/, which means "match a only at the beginning of the string, or b anywhere in the string, or c only at the end of the string", which is probably not what you meant, you probably meant /^(?:a|b|c)$/, that is /^(?:$regex_str)$/!
Finally, we compile the regular expression using qr//. This is not strictly necessary, you could just interpolate the string you've just created into a regex, but I prefer to turn them into regex objects explicitly. It also has the advantages that you can apply modifiers such as /i to the regex in a (IMO) more natural way, and that qr// implicitly adds a non-capturing group (?:...) around the regex, which takes care of the problem described in step 4 above.
When we print the regular expression, we see that it has become this:
```
(?^:ab\.|def|g\|h|ef|a|d)
[download]
```
You can now use this precompiled regular expression anywhere, as explained in Compiling and saving regular expressions and perlop, such as if ($input=~$regex) { ... } or while ($input=~/$regex/g) { ... }.

Thanks,
-- Hauke D

In reply to Re^2: [RFC] Building Regex Alternations Dynamically (updated x2) by haukex
in thread Building Regex Alternations Dynamically by haukex

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks