Building Regex Alternations Dynamically

TL;DR: The two code samples below are working pieces of code that can be copied into your Perl script and adapted for your purposes. A compact version is: my ($regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } keys %map;

I thought it might be useful to explain the technique of building regular expressions dynamically from a set of strings. Let's say you have a list of strings, like ('abc.', 'd$ef', 'gh|i') (that's a $ character, not a variable), and you want to build a regex that matches any of them, like /(?:abc\.|d\$ef|gh\|i)/ - note how the special characters are escaped with backslashes so they lose their special meanings and will be matched literally (more on this below). This also works well with s/search/replacement/ if you have a hash where the keys are the search strings and the values are the replacements, as I'll show below. If you're uncertain on some of the regex concepts used here, like alternations a|b and non-capturing groups (?:...), I recommend perlretut.

First, the basic code, which I explain below - note the numbering on the lines of code.

my @values = qw/ a ab. d ef def g|h /;
my ($regex) = map { qr/$_/ }          # 5.
    join '|',                         # 4.
    map {quotemeta}                   # 3.
    sort { length $b <=> length $a }  # 2.
    @values;                          # 1.
print "$regex\n";                     # 6.
[download]

We begin with the list of strings stored in the array @values. This could be any list, such as a literal qw/.../, or return values from functions, including keys or values.
We sort the list so that the longer strings appear first. This is necessary because if we didn't do this and our regular expression was /foo|foobar/, then applied to the string "foobarfoofoobar", it would only match "foo" three times, and never "foobar". But if the regex is /foobar|foo/, then it would correctly match "foobar", "foo", and again "foobar".
Next, we apply the quotemeta function to each string, which escapes any metacharacters that might have special meaning in a regex, such as . (dot, matches anything), | (alternation operator), or $ (anchor to end of line/string). In our example, we want the string "g|h" to be matched literally, and not to mean "match g or h". Unescaped metacharacters can also break the syntax of the regex, like stray opening parentheses or similar. Note that quotemeta is the same as using \Q...\E in a regex. As discussed here, you should only drop \Q...\E or quotemeta in the case that you explicitly want metacharacters in your input strings to be special, they come from a trusted source, and you are certain that your strings don't contain any characters that would break your regular expression or expose security holes!
Then, we join the strings into one long string with the regex alternation operator | in between each string. The string returned by join in the example looks like this: ab\.|def|g\|h|ef|a|d
This step compiles the regular expression using qr//. If you want to add modifiers such as /i (case-insensitive matching), this would be the place to do it, as in qr/$_/i. This line of code needs a bit of explanation: join from the previous step will return a single string, and so the map will evaluate its code block { qr/$_/ } once, with $_ being the string returned by join. The parentheses in my ($regex) = are required so that map will return the value from its code block (map in "list context"), instead of a count of the values (map in "scalar context") (for a trick on how to avoid the parentheses, see here). Context in Perl is a topic for another tutorial. Please note that if you want to add extra things to match in this qr//, then you most likely will want to write (?:$_) - the reason for this will be explained below. For example, if you want to apply the "word boundary" \b, you need to write qr/\b(?:$_)\b/.
When we print the regular expression, we see that it has become this:
```
(?^:ab\.|def|g\|h|ef|a|d)
[download]
```
You can now use this precompiled regular expression anywhere, as explained in Compiling and saving regular expressions and perlop, such as:
```
if ($input=~$regex) {
    print "It matches!\n";
}
# or
while ($input=~/($regex)/g) {
    print "Found string: $1\n";
}
[download]
```
Note that the qr// operator has implicitly added a non-capturing group (?:...) around the regular expression. This is important when you want to use the regular expression we've just built as part of a larger expression. For example, if your input strings are qw/a b c/ and you write /^$regex$/, then what you probably meant is /^(?:a|b|c)$/. If the non-capturing group wasn't there, then the regex would look like this: /^a|b|c$/, which means "match a only at the beginning of the string, or b anywhere in the string, or c only at the end of the string", which is probably not what you meant! (In the previous step, the same problem can happen, but you're responsible for adding the (?:...) around the $_ yourself, because at that point, $_ is just a plain string, and not yet a precompiled regular expression.)

Search and Replace Using a Hash

my %map = ( a=>1, ab=>23, cd=>45 );   # 1.
my ($regex) = map { qr/$_/ }          # 2.
    join '|', map {quotemeta}
    sort { length $b <=> length $a
           or $a cmp $b }             # 3.
    keys %map;
print "$regex\n";                     # 4.
# Now, use the regex
my @strings = qw/ abcd aacd abaab /;  # 5.
for (@strings) {
    my $before = $_;
    s/($regex)/$map{$1}/g;            # 6.
    print "$before -> $_\n";          # 7.
}
[download]

This is the hash in which the keys are the search strings, and the values are the replacements. As above, this can come from any source.
This code to build the regex is mostly the same as the above, with the following difference:
Instead of only sorting by length, this sort first sorts by length, and sorts values with the same length with a stringwise sort. While not strictly necessary, I would recommend this because hashes are unordered by default, meaning that your regex would be in a different order across different runs of the program. Sorting the hash keys like this causes the regex to be in the same order in every run of the program.
We print the regex for debugging, and see that it looks like this: (?^:ab|cd|a)
These @strings are the test strings we will apply the regular expression against.
This is the search and replace operation that matches the keys of the hash, and as a replacement value gets the corresponding value from the hash. Note that the /g modifier is not strictly required (s///g will replace all matches in the string, not just the first), and you can adapt this regex any way you like. So for example, to only make one replacement anchored at the beginning of the string, you can say s/^($regex)/$map{$1}/;.

The output of the code is:

abcd -> 2345
aacd -> 1145
abaab -> 23123
[download]

Thank you to all those who replied to this post as well as this one, in particular thanks to kcott, LanX, AnomalousMonk, and Haarg, whose suggestions ended up in the above!

Hope this helps,
-- Hauke D

Updates: 2017-05-14: Merged in the draft text I previously had in this node, made several updates to the text, and removed the "RFC" tag from the title. 2019-05-01: Updated first section regarding $_ in qr// (points 5 and 6), and updated TL;DR with a bit of code.

Back to Meditations