Is it safe to use external strings for regexes?

stevieb has asked for the wisdom of the Perl Monks concerning the following question:

Hey there fellow Monks, at my work, we've got a whole long list of regexes for parsing and organizing various information. These regexes are in the dozens, and are scattered across several scripts and libraries. What I'd like to do is store all of these regexes along with their mapping data in a database, so that review and maintenance of these mappings is easier.

My question is whether this is safe to do or not. If so, could you please share any potential unsafe examples?

I've drummed up a quick test scenario to ensure the building of regexes from strings gathered externally does seem to work ok:

String regex file:

a.*z
^\d+$
^\d{4}[AZ]\d$
[download]

Test script:

use warnings;
use strict;

use Test::More;

# Retrieve regexes from a text file (or database) as strings, regexify
+ them,
# then use them in code

my $re_file = 'regexes.txt';

open my $fh, '<', $re_file or die "Can't open $re_file: $!";

my $strings = strings();

my $i = 1;

while (my $str_re = <$fh>) {
    chomp $str_re;

    my $re = qr/$str_re/;

    for (@{ $strings->{$i}{match} }) {
        is $_ =~ $re, 1, "$_ matches $str_re ok";
    }
    for (@{ $strings->{$i}{nomatch} }) {
        is $_ =~ $re, '', "$_ doesn't match $str_re ok";
    }

    $i++;
}

done_testing;

sub strings {
    return {
        1 => {
            match   => [
                qw(
                    a123z
                    az
                    a!$@Zz
                ),
            ],
            nomatch => [
                qw(
                    Az
                    aZ
                    a213Z
                    99
                )
            ],
        },
        2 => {
            match   => [
                qw(
                    1
                    9999
                    6472323432
                ),
            ],
            nomatch => [
                qw(
                    a1
                    1a
                    1!
                    aaaa
                )
            ],
        },
        3 => {
            match   => [
                qw(
                    2021Z1
                    2021A1
                ),
            ],
            nomatch => [
                qw(
                    A9
                    123A9
                    1234a9
                    12349
                    1234A99999999
                    1234AZ9
                )
            ],
        },
    };
}
[download]

Comment on Is it safe to use external strings for regexes? Select or Download Code

Replies are listed 'Best First'.
Re: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 06, 2021 at 14:02 UTC
> My question is whether this is safe to do or not I'm not sure if you ask if your code or if foreign regexes "are safe". In the latter case, there are three issues I'm aware of code injection by string interpolation, like `/@{[ do_evil() ]}/` code injection by regex, like `/(?{ do_evil() })/` exponential time regexes with excessive backtracking, something like `/((x))*/` IIRC ² the first two cases might be solved by introspection/blacklisting regex-ops first, the latter probably only by experimenting with a hard limit on runtime. NB: it's even possible to "hide" a BEGIN block inside a regex, we had this discussion about 10 years ago, I'll update a link. ° Edit: We had regularly similar discussions over the years, you might want to Super Search the archives. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} updates °) here --> Re: Vulnerabilities when editing untrusted code... (Komodo) ²) more at regex-explosive-quantifiers	[reply] [d/l] [select]
Re^2: Is it safe to use external strings for regexes? by dave_the_m (Monsignor) on Oct 07, 2021 at 08:23 UTC
In the latter case, there are three issues I'm aware of code injection by string interpolation, like /@{ do_evil() }/ code injection by regex, like /(?{ do_evil() })/ exponential time regexes with excessive backtracking, something like /((x))/ IIRC </ol?* String interpolation of variables only happens for literal regexes in the source code. So if the pattern is read from a file or database this isn't an issue. Embedded code within a pattern is only allowed within the scope of `use re 'eval'`; otherwise trying to compile such a regex from a string will die at run time. The third one is a genuine issue, in terms of both CPU and memory usage. Dave.	[reply] [d/l]
Re^3: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 07, 2021 at 13:38 UTC
> So if the pattern is read from a file or database this isn't an issue. As I said "In the latter case" of general vulnerabilities, these are some issues to be aware of. The OP said > > These regexes are in the dozens, and are scattered across several scripts and libraries. > > maintenance of these mappings is easier. I doubt the general case can be solved with a DB of simple strings. Maintainable regexes are composed of smaller ones by interpolation and dynamic compilation. Which brings us back to start. > is only allowed within the scope of use re 'eval'; with "newer" Perls yes. I noticed that you changed it around 2013, and am thankful for that. * > The third one is a genuine issue, in terms of both CPU and memory usage. well some regex engines optimize sometimes better than Perl's. I remember a demo of a case with nested quantifiers where unix' grep did very well and Perl waited for the end of times. This could be eased by analyzing the regex for potential traps like listed here and warning accordingly. This analyze could be done by parsing the compilation with `re 'debug';` ° But again this could open the door for those general vulnerabilities, that's why I prefer to point to them. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} °) for completeness TheDamian published a static parser for perl regexes, I can't tell how closely it incorporates new features. *) Some IDEs do `perl -c` on default when they open a perl file. Sending a troyan script with a evil BEGIN block will execute instantly after opening. And obfuscation with Acme::EyeDrops will still allow hiding the evil logic into a regex, one just needs to add `use re 'eval';` for newer Perls	[reply] [d/l] [select]
Re^4: Is it safe to use external strings for regexes? by dave_the_m (Monsignor) on Oct 07, 2021 at 15:26 UTC
Re^5: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 07, 2021 at 20:50 UTC
Some notes below your chosen depth have not been shown here
Re^2: Is it safe to use external strings for regexes? (use Safe) by LanX (Saint) on Oct 06, 2021 at 16:21 UTC
FWIW: there is the Safe module to disallow certain Op-codes inside a (r)eval. `use Safe; $compartment = new Safe; $compartment->permit(qw(time sort :browse)); $result = $compartment->reval($unsafe_code);` [download] Unfortunately I couldn't find a way to disable compiletime blocks like `BEGIN` and there doesn't seem to be another way to disable or `override` BEGIN... I'd love to be corrected. UPDATE oh Keyword::Simple could do the trick :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: Is it safe to use external strings for regexes? (Keyword::Simple) by LanX (Saint) on Oct 06, 2021 at 17:06 UTC
It's indeed possible to bend the parser in a way that it thinks BEGIN and family are subs use strict; use warnings; use Keyword::Simple; sub no_begin ($&){ warn "no_begin(@_) called"; } my @code; BEGIN{ my @compile_blocks = qw(BEGIN UNITCHECK CHECK INIT END); for my $block (@compile_blocks) { # bend parser Keyword::Simple::define $block, sub { my ($ref) = @_; substr($$ref, 0, 0) = "no_begin '$block', sub"; }; # test code push @code , <<__CODE__; $block { die "owened by $block" } __CODE__ } } BEGIN { die "owened by BEGIN" }; UNITCHECK { die "owened by UNITCHECK" }; CHECK { die "owened by CHECK" }; INIT { die "owened by INIT" }; END { die "owened by END" }; eval join "\n", @code; [download] -- mode: compilation; default-directory: "d:/tmp/pm/" -- Compilation started at Wed Oct 6 19:04:01 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/KW_simple_regex_BEGIN.pl no_begin(BEGIN CODE(0x694268)) called at d:/tmp/pm/KW_simple_regex_BEG +IN.pl line 6. no_begin(UNITCHECK CODE(0x6556d0)) called at d:/tmp/pm/KW_simple_regex +_BEGIN.pl line 6. no_begin(CHECK CODE(0x6942b0)) called at d:/tmp/pm/KW_simple_regex_BEG +IN.pl line 6. no_begin(INIT CODE(0x6c8aa0)) called at d:/tmp/pm/KW_simple_regex_BEGI +N.pl line 6. no_begin(END CODE(0x6c8c38)) called at d:/tmp/pm/KW_simple_regex_BEGIN +.pl line 6. Bareword found where operator expected at (eval 5) line 5, near "} CHECK" (Missing operator before CHECK?) [download] But unfortunately does evaling the code not catch parsing errors anymore... (reason here BEGIN{} blocks don't need a trailing semicolon) so the answer is: Yes BEGIN* blocks can be disabled. But this is best done in an extra process Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Is it safe to use external strings for regexes? by Anonymous Monk on Oct 06, 2021 at 22:50 UTC
Use taint or no https://perldoc.perl.org/re#'eval'-mode	[reply]
Re^3: Is it safe to use external strings for regexes? by perlfan (Vicar) on Oct 13, 2021 at 20:08 UTC
I also took the meat of this question to be about accepting user input, then throwing that into a regex - only to say, don't trust user input directly - as always. There's only one mention of taint in this whole thread, and I am replying to it. :-)	[reply]
Re: Is it safe to use external strings for regexes? by Corion (Patriarch) on Oct 06, 2021 at 13:37 UTC
Depending on how nasty your users are, allowing arbitrary regular expressions is an unwise choice. The following regex is valid but will use up lots of CPU: `"aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /aaa*b/` [download] If you can come up with a whitelist of allowed regexes, that would improve things, or maybe consider running the regex search as a time-limited subprocess.	[reply] [d/l]
Re^2: Is it safe to use external strings for regexes? by stevieb (Canon) on Oct 06, 2021 at 13:44 UTC
Thanks Corion, that's a good point. The regexes will only be added/edited by seasoned programmers, but I do know that many people who think they know regexes really don't. I can definitely add in some checks in conjunction with our existing review processes, but I mostly like the idea of time-limited sub processes to handle the actual work (which can alert if something takes too long).	[reply]
Re^3: Is it safe to use external strings for regexes? by Fletch (Bishop) on Oct 06, 2021 at 17:26 UTC
It was discussed here in Cloudflare blames PCRE for outage and a blog at cloudflare but they found out a couple years ago even "seasoned programmers" can shoot themselves in the foot as well. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re^4: Is it safe to use external strings for regexes? by stevieb (Canon) on Oct 06, 2021 at 19:40 UTC
Re: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 06, 2021 at 18:08 UTC
here a way you could go, to counter the problems listed and explained here code injection by string interpolation, like `/@{[ do_evil() ]}/` code injection by regex, like `/(?{ do_evil() })/` exponential time regexes with excessive backtracking, something like `/((x))*/` IIRC This will compile a regex into an anonymous sub without executing it `use re qw(debug); my $sub = eval "sub { m/$evil_re/ }";` [download] the `re` `debug` will emit regex-opcodes for the regexes involved to STDERR `Final program: 1: EVAL (4) 4: EXACT <\n> (6) 6: END (0)` [download] the `1: EVAL` here tells you that an EVAL was involved which you need to reject, you don't want embedded Perl code `$evil_re = "(?{ BEGIN { do_evil() } })";` with Keyword::Simple disabling BEGIN,END,... etc you won't risk that the compilation of the sub inside the eval will run any code (see here) with `Safe` you'll be able to additionally disable a bunch of external commands. (see here) For this to work you need to spawn an external command for each regex and capture STDERR, you can use this to also limit the maximal runtime. Since your code looks a lot like a test suite, you might wanna use the TAP protocol anyway. NB: No guaranties whatsoever! HTH! :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Is it safe to use external strings for regexes? by AnomalousMonk (Archbishop) on Oct 07, 2021 at 20:38 UTC
As a matter of curiosity, I tried some of the classic regexes mentioned in this thread that threaten exponential explosion. They seem to have been tamed long ago. (Same results for version 5.30.3.1.) What are some examples that can still go exponential? Win8 Strawberry 5.8.9.5 (32) Thu 10/07/2021 16:15:36 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -l my $futile = 'a' x 10_000; print 'start ', scalar time; die 'huh?' if $futile =~ /aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +ab/ ; print 'post rx 1 ', scalar time; die 'huh?' if $futile =~ /(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:a))))))))))))* +)*b/ ; print 'post rx 2 ', scalar time; print 'done ', scalar time; ^Z start 1633637901 post rx 1 1633637901 post rx 2 1633637901 done 1633637901 [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 07, 2021 at 20:58 UTC
My guess: Perl is looking for exact strings from both ends, your regexes include a trailing "b" but your string $futile doesn't This simplified demo seems to support my theory D:\tmp\pm>perl -Mre=debug -E"'aaaa' =~/aab/" Compiling REx "aab" synthetic stclass "ANYOF[ab]". Final program: 1: STAR (4) 2: EXACT <a> (0) 4: STAR (7) 5: EXACT <a> (0) 7: EXACT <b> (9) 9: END (0) floating "b" at 0..9223372036854775807 (checking floating) stclass ANY +OF[ab] min len 1 Matching REx "aab" against "aaaa" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..4] gave -1 Did not find floating substr "b"... Match rejected by optimizer Freeing REx: "aab" [download] please note Did not find floating substr "b"... Match rejected by optimizer Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^2: Is it safe to use external strings for regexes? by LanX (Saint) on Oct 07, 2021 at 21:53 UTC
Replace the "exact" `b` with a character class `[bc]` Then Perl can't rule out the string because of the missing end and you'll see exponential growth. use strict; use warnings; $\="\n"; $\|=1; redos($_) for 5..8; sub redos { my ($length)=@_; my $futile = 'a' x $length; print "=== length=$length string=$futile"; print 'start ', my $start = scalar time; die 'huh?' if $futile =~ /aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aa[bc]/ ; print 'post rx 1 ', time -$start," sec"; die 'huh?' if $futile =~ /(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:a))))))))))) +))[bc]/ ; print 'post rx 2 ', time -$start," sec"; print 'done ', time -$start," sec"; print "\n" x2; } [download] `C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/redos.pl === length=5 string=aaaaa start 1633643412 post rx 1 0 sec post rx 2 0 sec done 0 sec === length=6 string=aaaaaa start 1633643412 post rx 1 1 sec post rx 2 1 sec done 1 sec === length=7 string=aaaaaaa start 1633643413 post rx 1 4 sec post rx 2 4 sec done 4 sec === length=8 string=aaaaaaaa start 1633643417 post rx 1 20 sec post rx 2 20 sec done 20 sec` [download] it's the first regex which is obviously growing in an exponential manner... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Is it safe to use external strings for regexes? (infinite loops) by LanX (Saint) on Oct 11, 2021 at 12:31 UTC
Hi I just stumbled over an example for something even worse: infinite loops perlre#Repeated Patterns Matching a Zero-length Substring > A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as: > `"foo" =~ m{ ( o? ) }x;`* > The o? matches at the beginning of "foo", and since the position in the string is not moved by the match, o? would match again and again because of the "*" quantifier. Another common way to create a similar cycle is with the looping modifier /g: Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: Is it safe to use external strings for regexes? (infinite loops) by choroba (Cardinal) on Oct 11, 2021 at 13:29 UTC
Huh? `$ perl -wE '$f = "foo"; say pos $f while $f =~ m{ ( o? )* }gx;' 0 3 3` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Is it safe to use external strings for regexes? (infinite loops) by LanX (Saint) on Oct 11, 2021 at 15:36 UTC
Re^5: Is it safe to use external strings for regexes? (infinite loops) by choroba (Cardinal) on Oct 11, 2021 at 15:40 UTC
Some notes below your chosen depth have not been shown here
Re^5: Is it safe to use external strings for regexes? (infinite loops) by AnomalousMonk (Archbishop) on Oct 11, 2021 at 17:56 UTC
Re^3: Is it safe to use external strings for regexes? (infinite loops) by AnomalousMonk (Archbishop) on Oct 11, 2021 at 18:01 UTC
The section to which you linked goes on to say Thus Perl allows such constructs, by forcefully breaking the infinite loop. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^4: Is it safe to use external strings for regexes? (infinite loops) by AnomalousMonk (Archbishop) on Oct 11, 2021 at 21:58 UTC
Re^5: Is it safe to use external strings for regexes? (infinite loops) by LanX (Saint) on Oct 11, 2021 at 22:45 UTC
Some notes below your chosen depth have not been shown here
Re^4: Is it safe to use external strings for regexes? (infinite loops) by LanX (Saint) on Oct 11, 2021 at 21:27 UTC
Re: Is it safe to use external strings for regexes? by Anonymous Monk on Oct 06, 2021 at 15:16 UTC
Why not stick your regexen in a custom module? You're going to have to change code to centralize them anyway, and `use Our::Custom::Regexen;` looks to me like a lot less work than pulling them out of a database. In fact, the database solution might be enough work that you end up wrapping it in a custom module anyway. Source control on this module may not prevent the injection of broken code, but at least will let you figure out who did it. Your example script could easily be converted into a test using the Perl testing infrastructure, however you choose to implement.	[reply] [d/l]
Re^2: Is it safe to use external strings for regexes? by NERDVANA (Deacon) on Oct 07, 2021 at 12:38 UTC
Expanding on this idea, it looks like those regexes might be identifying types of data? If so, consider a type library built on Type::Tiny	[reply]
Re^3: Is it safe to use external strings for regexes? by Anonymous Monk on Oct 07, 2021 at 14:35 UTC
Or maybe Regexp::Common if you have to infer the data type from its form.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Seekers of Perl Wisdom

updates

UPDATE