Stop runaway regex

yiannis2014 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stop runaway regex by davido (Cardinal) on May 30, 2014 at 04:46 UTC
This question was also posted on StackOverflow, and I took some time answering it there based on experience I acquired when an application I wrote needed to accept user regexes, and needed to not succumb to DOS attacks. Here's a recap: alarm is inadequate; it cannot interrupt the running regexp engine. Sys::SigAction provides a function called `timeout_call`, which is capable of interrupting the regexp engine while it's running. However, the RE engine was not designed for this possibility. It can be left in an unstable state, which can (and often enough will) lead to segfaults (tested on various versions of Perl). This is usually undesirable. Presumably POSIX::SigAction will share the same weakness, as the weakness is really the fact that the RE engine isn't designed to be interrupted. If your regular expression will work with the RE2 engine, you are in luck, because it guarantees linear-time searches. There is a CPAN module that interfaces with the RE2 engine, as a drop-in replacement for Perl's engine: re::engine::RE2. Here's the big catch though: The "linear-time" guarantee comes at the cost of many of the powerful regex semantics we've come to expect with Perl's elaborate RE engine. For example, RE2 has no backreferences, nor zero-width assertions. If you need those, this won't work for you. But if you can live with its limitations, it is a fantastic option (assuming you've got a recent enough Perl to use it). The best solution that provides the full semantic power of Perl's regular expressions, while also providing the ability to safely time out, is the fork/alarm/wait idiom. Fork a worker, set an alarm, wait, and if the alarm expires, shut down the worker. No need to worry about the process becoming unstable; you're done with it anyway. Dave	[reply] [d/l]
Re^2: Stop runaway regex # CROSSPOST by LanX (Saint) on May 30, 2014 at 15:15 UTC
> This question was also posted on StackOverflow, Shouldn't it be marked there, too? http://stackoverflow.com/q/23937014/716443 Cheers Rolf ( addicted to the Perl Programming Language) update rephrased	[reply]
Re^3: Stop runaway regex # CROSSPOST by davido (Cardinal) on May 30, 2014 at 15:49 UTC
Done. (I added a mention to the bottom of my response to the SO question.) Dave	[reply]
Re: Stop runaway regex # alarm , /g , (?{code}) by LanX (Saint) on May 29, 2014 at 15:37 UTC
> is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete No! But you might want to experiment with alarm and stop the whole program with die˛. > I know it can be modified so it doesn't break, etc, while I don't like the way you are asking, I'll give you two general hints: you might wanna restructure your regex with /g-modifier and pos such that it works within a loop. Then you are able to execute Perl code to measure the runtime. as a (more experimental) variation of the former: Regexes can embed perlcode with `(?{code})` syntax, you may want to experiment with this and die after a timespan.š Cheers Rolf ( addicted to the Perl Programming Language) š) I said `die` cause I've never seen any return-like exit from a regex ...which doesn't mean it's impossible˛ (?). ( But I am not interested to test it for you :) update ˛) well you can always catch `die` from within `eval { BLOCK }`	[reply] [d/l] [select]
Re^2: Stop runaway regex # (?{code}) by LanX (Saint) on May 29, 2014 at 16:58 UTC
> But I am not interested to test it for you :) Well curiosity won! return causes an error and "won't stay shared" lexical variables are problematic too. Though this works and can be extended: `use strict; use warnings; my $start; my $diff; my $timeout; sub tst { $timeout=shift; my $str = "a"x10000; $str .= "b"; $start = time; $str =~ /^ (( a* (?{ $diff= time-$start; die "stopped after $diff sec" if $diff >=$timeout; }) )) $/x; } tst(10);` [download] output: `Complex regular subexpression recursion limit (32766) exceeded at time +out_regex.pl line 15. stopped after 10 sec at (re_eval 1) line 3.` [download] Cheers Rolf ( addicted to the Perl Programming Language) update NB: this approach considerably slows down the regex when done in the innermost group-loop, outer loops OTOH might be checked too seldom. YMMV	[reply] [d/l] [select]
Re^3: Stop runaway regex # (?{code}) by Anonymous Monk on May 29, 2014 at 20:08 UTC
return causes an error and "won't stay shared" lexical variables are problematic too. Get newer perl (latest), this part upgraded In the meantime, local our $diff; ...	[reply]
Re^2: Stop runaway regex # alarm problematic by LanX (Saint) on May 29, 2014 at 18:16 UTC
Well, while the docs look promising, alarm didn't interrupt regexes for me! Signals are executed only after the regex is completed. (5.10 / Linux) Found this thread "Timeout alarm for regex", so this always used to be problematic! (see also Deferred Signals (Safe Signals)) No idea if it's still the case for newer Perl versions. Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re: Stop runaway regex by mr_mischief (Monsignor) on May 29, 2014 at 19:48 UTC
As LanX had guessed, non-deferred signals can do what you want. It's not that pretty, but it works. #!/usr/bin/perl use strict; use warnings; $\|++; my $s = 0; use POSIX qw(SIGALRM); POSIX::sigaction(SIGALRM, POSIX::SigAction->new( sub { warn "skipping +$s (took too long)\n"; die } )) \|\| die "Error setting SIGALRM handler +: $!\n"; my $str = 'ffvadsvefwdvewrfvt4vketwrhjkbveqwkjhfkghjlfghjkufghjkfhjkfj +kgfghfkhjfkhjgfhjgfhgfhkgfhkgfhkgfhkgfkhjgfkjgfkghjfkhjgfhjgfkhjgfhjk +fk' x 40960; $str .= 'hjkbklklhbjklercvqewrqereqrfqeerv;;;jnrveervnlknrvlerlvnerlnv +elrvnervlkenvlervojubnertvffff;kn;kff;kn;fk;k;;kmnff;knmf;nff;mnkf;;k +;;' x 40960; my $str2 = $str x 8; my $str3 = 'furrfu'; my $re = qr/(f((\w?)(\w?))?)+/; print time . "\n\n"; for ( $str, $str2, $str3 ) { $s++; my $res; alarm 2; eval { $res = $_ =~ s/$re/ ^_^ /g; }; print "$s made $res\n" unless $@; } print "\n" . time . "\n"; exit; [download] This puts out something similar to the following, given the system is slow enough to take more than a second on that second string but not on the first (or third -- gosh, let's hope not!). `1401392795 1 made 1310721 skipping 2 (took too long) 3 made 2 1401392798` [download] `This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-th +read-multi-2level (with 3 registered patches, see perl -V for more detail)` [download] Update:* Per davido's advice in the thread, I'll point out that the above is tested but not thoroughly so. One might hope that the die() and ending the eval would be enough unrolling of state that no segfaults or other wonkiness would happen when the regular expression engine is interrupted and reinvoked. If not, the forking model does make a lot of sense. One might also put code for handling one regex at a time into a separate script and hand off to that with something like IPC::Open3 which will handle parts of the child management and inter-process communication for you.	[reply] [d/l] [select]
Re^2: Stop runaway regex by LanX (Saint) on May 30, 2014 at 15:10 UTC
FYI: Even w/o deferred signals I was able to create segfaults with alarm (5.10). Just by putting simple `(?{code})` into the regex to catch the alarm-signal. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l]
Re^3: Stop runaway regex by mr_mischief (Monsignor) on May 30, 2014 at 15:24 UTC
I was unable to get a segfault testing the exact code I posted on either 5.10.1 on CentOS or 5,16,2 on OSX. I only tested it with the strings and regex listed in the code, though.	[reply]
Re^4: Stop runaway regex by LanX (Saint) on May 30, 2014 at 15:44 UTC
Re^5: Stop runaway regex by mr_mischief (Monsignor) on May 30, 2014 at 15:53 UTC
Re: Stop runaway regex by DrHyde (Prior) on May 30, 2014 at 11:07 UTC
There's an example in CPAN::ParseDistribution::Unix. It uses Parallel::ForkManager to run the potentially time-consuming code in a seperate process, communicate its results back to the parent process, and time it out if it takes too long. This code only works reliably on Unix. If you can find a way of making it work on legacy platforms then I'd love to know.	[reply]
Re: Stop runaway regex by Anonymous Monk on May 29, 2014 at 15:35 UTC
modifying it means I need to retest it on all the inputs. This should be trivial to do. One file full of input strings and one full of their expected results. `while (my $input = <$testInputsFH>) { my $result = doBigRegexOn($input); my $expected = <$expectedResultFH>; warn "Test failed; got $result instead of $expected on line $.\n" un +less $result eq $expected; }` [download]	[reply] [d/l]
Re: Stop runaway regex by taint (Chaplain) on Jun 02, 2014 at 16:11 UTC
Warning: what follows is purely conceptual. Perhaps I'm just too simple minded; But wouldn't it be possible to create a file, as a pseudo pid file. Then check against it's mtime, or atime, for use as a timer. Then wrap the/whatever time-sensitive process in a for/while loop. All the while checking the time since the file was created, if that time has exceeded that time. Issue an `exit`? Using system (NIX). Part of it might look like: `my $arg1 = ('-XP . -type l -cmin'); my $arg2 = ('+15'); my $arg3 = ('xargs rm'); system("/usr/bin/find $arg1 $arg2 \| $arg3");` [download] Utilizing the above, one could even make the loop depend* upon the existence of the "pseudo pid" file. No? Like I said; conceptual. But I used a similar approach for something else "time sensitive/related (which is where the code above was taken), and it works great. --Chris Ąλɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH	[reply] [d/l]

Stop runaway regex

update

update

update