perlmeditation
haukex
<p><small><b>TL;DR:</b> The two code samples below are working pieces of code that can be copied into your Perl script and adapted for your purposes. A compact version is: <c>my ($regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } keys %map;</c></small></p>
<p>I thought it might be useful to explain the technique of building regular expressions dynamically from a set of strings. Let's say you have a list of strings, like <c>('abc.', 'd$ef', 'gh|i')</c> (that's a <c>$</c> character, not a variable), and you want to build a regex that matches any of them, like <c>/(?:abc\.|d\$ef|gh\|i)/</c> - note how the special characters are escaped with backslashes so they lose their special meanings and will be matched <i>literally</i> (more on this below). This also works well with <c>s/search/replacement/</c> if you have a hash where the keys are the search strings and the values are the replacements, as I'll show [href://#search_replace_with_hash|below]. If you're uncertain on some of the regex concepts used here, like [doc://perlretut#Matching-this-or-that|alternations] <c>a|b</c> and [doc://perlretut#Non-capturing-groupings|non-capturing groups] <c>(?:...)</c>, I recommend [doc://perlretut].</p>
<p>First, the basic code, which I explain below - note the numbering on the lines of code.</p>
<c>
my @values = qw/ a ab. d ef def g|h /;
my ($regex) = map { qr/$_/ } # 5.
join '|', # 4.
map {quotemeta} # 3.
sort { length $b <=> length $a } # 2.
@values; # 1.
print "$regex\n"; # 6.
</c>
<ol>
<li>We begin with the list of strings stored in the array <c>@values</c>. This could be any list, such as a literal [doc://perlop#qw/_STRING_/|<c>qw/.../</c>], or return values from functions, including [doc://keys] or [doc://values].</li>
<li>We [doc://sort] the list so that the longer strings appear first. This is necessary because if we didn't do this and our regular expression was <c>/foo|foobar/</c>, then applied to the string <c>"foobarfoofoobar"</c>, it would only match <c>"foo"</c> three times, and <i>never</i> <c>"foobar"</c>. But if the regex is <c>/foobar|foo/</c>, then it would correctly match <c>"foobar"</c>, <c>"foo"</c>, and again <c>"foobar"</c>.</li>
<li>Next, we apply the [doc://quotemeta] function to each string, which escapes any metacharacters that might have special meaning in a regex, such as <c>.</c> (dot, matches anything), <c>|</c> (alternation operator), or <c>$</c> (anchor to end of line/string). In our example, we want the string <c>"g|h"</c> to be matched <i>literally</i>, and not to mean "match <c>g</c> or <c>h</c>". Unescaped metacharacters can also break the syntax of the regex, like stray opening parentheses or similar. Note that <c>quotemeta</c> is the same as using <c>\Q...\E</c> in a regex. As discussed [id://1156540|here], you should <i>only</i> drop <c>\Q...\E</c> or <c>quotemeta</c> in the case that you explicitly want metacharacters in your input strings to be special, they come from a trusted source, and you are <i>certain</i> that your strings don't contain any characters that would break your regular expression or expose security holes!</li>
<li>Then, we [doc://join] the strings into one long string with the regex alternation operator <c>|</c> in between each string. The string returned by <c>join</c> in the example looks like this: <c>ab\.|def|g\|h|ef|a|d</c></li>
<li>This step compiles the regular expression using [doc://perlretut#Compiling-and-saving-regular-expressions|<c>qr//</c>]. If you want to add [doc://perlre#Modifiers|modifiers] such as <c>/i</c> (case-insensitive matching), this would be the place to do it, as in <c>qr/$_/i</c>. This line of code needs a bit of explanation: [doc://join] from the previous step will return a single string, and so the [doc://map] will evaluate its code block <c>{ qr/$_/ }</c> once, with <c>$_</c> being the string returned by <c>join</c>. The parentheses in <c>my ($regex) =</c> are required so that [doc://map] will return the value from its code block (<c>map</c> in "list context"), instead of a count of the values (<c>map</c> in "scalar context") <small>(for a trick on how to avoid the parentheses, see [id://11116849|here])</small>. Context in Perl is a topic for [id://738558|another] tutorial. Please note that if you want to add extra things to match in this <c>qr//</c>, then you most likely will want to write <c>(?:$_)</c> - the reason for this will be explained below. For example, if you want to apply the "word boundary" <c>\b</c>, you need to write <c>qr/\b(?:$_)\b/</c>.</li>
<li><p>When we print the regular expression, we see that it has become this:</p>
<c>
(?^:ab\.|def|g\|h|ef|a|d)
</c>
<p>You can now use this precompiled regular expression anywhere, as explained in [doc://perlretut#Compiling-and-saving-regular-expressions] and [doc://perlop#qr/_STRING_/msixpodualn|perlop], such as:
<c>
if ($input=~$regex) {
print "It matches!\n";
}
# or
while ($input=~/($regex)/g) {
print "Found string: $1\n";
}
</c>
<p>Note that the <c>qr//</c> operator has implicitly added a non-capturing group <c>(?:...)</c> around the regular expression. This is important when you want to use the regular expression we've just built as part of a larger expression. For example, if your input strings are <c>qw/a b c/</c> and you write <c>/^$regex$/</c>, then what you probably meant is <c>/^(?:a|b|c)$/</c>. If the non-capturing group wasn't there, then the regex would look like this: <c>/^a|b|c$/</c>, which means "match <c>a</c> only at the beginning of the string, or <c>b</c> anywhere in the string, or <c>c</c> only at the end of the string", which is probably not what you meant! (In the previous step, the same problem can happen, but you're responsible for adding the <c>(?:...)</c> around the <c>$_</c> yourself, because at that point, <c>$_</c> is just a plain string, and not yet a precompiled regular expression.)</p>
</li>
</ol>
<h3><a name="search_replace_with_hash"></a>Search and Replace Using a Hash</h3>
<c>
my %map = ( a=>1, ab=>23, cd=>45 ); # 1.
my ($regex) = map { qr/$_/ } # 2.
join '|', map {quotemeta}
sort { length $b <=> length $a
or $a cmp $b } # 3.
keys %map;
print "$regex\n"; # 4.
# Now, use the regex
my @strings = qw/ abcd aacd abaab /; # 5.
for (@strings) {
my $before = $_;
s/($regex)/$map{$1}/g; # 6.
print "$before -> $_\n"; # 7.
}
</c>
<ol>
<li>This is the hash in which the keys are the search strings, and the values are the replacements. As above, this can come from any source.</li>
<li>This code to build the regex is mostly the same as the above, with the following difference:</li>
<li>Instead of only sorting by length, this [doc://sort] first sorts by length, and sorts values with the same length with a stringwise sort. While not strictly necessary, I would recommend this because hashes are unordered by default, meaning that your regex would be in a different order across different runs of the program. Sorting the hash keys like this causes the regex to be in the same order in every run of the program.</li>
<li>We print the regex for debugging, and see that it looks like this: <c>(?^:ab|cd|a)</c></li>
<li>These <c>@strings</c> are the test strings we will apply the regular expression against.</li>
<li>This is the search and replace operation that matches the keys of the hash, and as a replacement value gets the corresponding value from the hash. Note that the <c>/g</c> modifier is not strictly required (<c>s///g</c> will replace all matches in the string, not just the first), and you can adapt this regex any way you like. So for example, to only make one replacement anchored at the beginning of the string, you can say <c>s/^($regex)/$map{$1}/;</c>.</li>
<li>The output of the code is:
<c>
abcd -> 2345
aacd -> 1145
abaab -> 23123
</c>
</li>
</ol>
<p><b>Thank you</b> to all those who replied to this post as well as [id://1179847|this one], in particular thanks to [kcott], [LanX], [AnomalousMonk], and [Haarg], whose suggestions ended up in the above!</p>
<p>Hope this helps,<br>-- Hauke D</p>
<p><small><i>Updates:</i> 2017-05-14: Merged in the draft text I previously had in [id://1179921|this] node, made several updates to the text, and removed the "RFC" tag from the title. 2019-05-01: Updated first section regarding <c>$_</c> in <c>qr//</c> (points 5 and 6), and updated TL;DR with a bit of code.</small></p>