Re: Stop runaway regex
by davido (Cardinal) on May 30, 2014 at 04:46 UTC
|
This question was also posted on StackOverflow, and I took some time answering it there based on experience I acquired when an application I wrote needed to accept user regexes, and needed to not succumb to DOS attacks. Here's a recap:
alarm is inadequate; it cannot interrupt the running regexp engine.
Sys::SigAction provides a function called timeout_call, which is capable of interrupting the regexp engine while it's running. However, the RE engine was not designed for this possibility. It can be left in an unstable state, which can (and often enough will) lead to segfaults (tested on various versions of Perl). This is usually undesirable. Presumably POSIX::SigAction will share the same weakness, as the weakness is really the fact that the RE engine isn't designed to be interrupted.
If your regular expression will work with the RE2 engine, you are in luck, because it guarantees linear-time searches. There is a CPAN module that interfaces with the RE2 engine, as a drop-in replacement for Perl's engine: re::engine::RE2. Here's the big catch though: The "linear-time" guarantee comes at the cost of many of the powerful regex semantics we've come to expect with Perl's elaborate RE engine. For example, RE2 has no backreferences, nor zero-width assertions. If you need those, this won't work for you. But if you can live with its limitations, it is a fantastic option (assuming you've got a recent enough Perl to use it).
The best solution that provides the full semantic power of Perl's regular expressions, while also providing the ability to safely time out, is the fork/alarm/wait idiom. Fork a worker, set an alarm, wait, and if the alarm expires, shut down the worker. No need to worry about the process becoming unstable; you're done with it anyway.
| [reply] [d/l] |
|
| [reply] |
|
| [reply] |
Re: Stop runaway regex # alarm , /g , (?{code})
by LanX (Saint) on May 29, 2014 at 15:37 UTC
|
> is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete
No! But you might want to experiment with alarm and stop the whole program with die².
> I know it can be modified so it doesn't break, etc,
while I don't like the way you are asking, I'll give you two general hints:
- you might wanna restructure your regex with /g-modifier and pos such that it works within a loop. Then you are able to execute Perl code to measure the runtime.
- as a (more experimental) variation of the former: Regexes can embed perlcode with (?{code}) syntax, you may want to experiment with this and die after a timespan.¹
Cheers Rolf
( addicted to the Perl Programming Language)
¹) I said die cause I've never seen any return-like exit from a regex ...which doesn't mean it's impossible² (?). ( But I am not interested to test it for you :)
update
²) well you can always catch die from within eval { BLOCK } | [reply] [d/l] [select] |
|
use strict;
use warnings;
my $start;
my $diff;
my $timeout;
sub tst {
$timeout=shift;
my $str = "a"x10000;
$str .= "b";
$start = time;
$str =~ /^ (( a*
(?{
$diff= time-$start;
die "stopped after $diff sec"
if $diff >=$timeout;
})
)*)*
$/x;
}
tst(10);
output:
Complex regular subexpression recursion limit (32766) exceeded at time
+out_regex.pl line 15.
stopped after 10 sec at (re_eval 1) line 3.
Cheers Rolf
( addicted to the Perl Programming Language)
update
NB: this approach considerably slows down the regex when done in the innermost group-loop, outer loops OTOH might be checked too seldom. YMMV | [reply] [d/l] [select] |
|
return causes an error and "won't stay shared" lexical variables are problematic too.
Get newer perl (latest), this part upgraded
In the meantime, local our $diff; ...
| [reply] |
|
Well, while the docs look promising, alarm didn't interrupt regexes for me!
Signals are executed only after the regex is completed. (5.10 / Linux)
Found this thread "Timeout alarm for regex", so this always used to be problematic!
(see also Deferred Signals (Safe Signals))
No idea if it's still the case for newer Perl versions.
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] |
Re: Stop runaway regex
by mr_mischief (Monsignor) on May 29, 2014 at 19:48 UTC
|
As LanX had guessed, non-deferred signals can do what you want. It's not that pretty, but it works.
#!/usr/bin/perl
use strict;
use warnings;
$|++;
my $s = 0;
use POSIX qw(SIGALRM);
POSIX::sigaction(SIGALRM, POSIX::SigAction->new( sub { warn "skipping
+$s (took too long)\n"; die } )) || die "Error setting SIGALRM handler
+: $!\n";
my $str = 'ffvadsvefwdvewrfvt4vketwrhjkbveqwkjhfkghjlfghjkufghjkfhjkfj
+kgfghfkhjfkhjgfhjgfhgfhkgfhkgfhkgfhkgfkhjgfkjgfkghjfkhjgfhjgfkhjgfhjk
+fk' x 40960;
$str .= 'hjkbklklhbjklercvqewrqereqrfqeerv;;;jnrveervnlknrvlerlvnerlnv
+elrvnervlkenvlervojubnertvffff;kn;kff;kn;fk;k;;kmnff;knmf;nff;mnkf;;k
+;;' x 40960;
my $str2 = $str x 8;
my $str3 = 'furrfu';
my $re = qr/(f((\w?)(\w*?))?)+/;
print time . "\n\n";
for ( $str, $str2, $str3 ) {
$s++;
my $res;
alarm 2;
eval {
$res = $_ =~ s/$re/ ^_^ /g;
};
print "$s made $res\n" unless $@;
}
print "\n" . time . "\n";
exit;
This puts out something similar to the following, given the system is slow enough to take more than a second on that second string but not on the first (or third -- gosh, let's hope not!).
1401392795
1 made 1310721
skipping 2 (took too long)
3 made 2
1401392798
This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-th
+read-multi-2level
(with 3 registered patches, see perl -V for more detail)
Update: Per davido's advice in the thread, I'll point out that the above is tested but not thoroughly so. One might hope that the die() and ending the eval would be enough unrolling of state that no segfaults or other wonkiness would happen when the regular expression engine is interrupted and reinvoked. If not, the forking model does make a lot of sense. One might also put code for handling one regex at a time into a separate script and hand off to that with something like IPC::Open3 which will handle parts of the child management and inter-process communication for you.
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
I was unable to get a segfault testing the exact code I posted on either 5.10.1 on CentOS or 5,16,2 on OSX. I only tested it with the strings and regex listed in the code, though.
| [reply] |
|
|
Re: Stop runaway regex
by DrHyde (Prior) on May 30, 2014 at 11:07 UTC
|
There's an example in CPAN::ParseDistribution::Unix. It uses Parallel::ForkManager to run the potentially time-consuming code in a seperate process, communicate its results back to the parent process, and time it out if it takes too long. This code only works reliably on Unix. If you can find a way of making it work on legacy platforms then I'd love to know. | [reply] |
Re: Stop runaway regex
by Anonymous Monk on May 29, 2014 at 15:35 UTC
|
modifying it means I need to retest it on *all* the inputs.
This should be trivial to do. One file full of input strings and one full of their expected results.
while (my $input = <$testInputsFH>)
{
my $result = doBigRegexOn($input);
my $expected = <$expectedResultFH>;
warn "Test failed; got $result instead of $expected on line $.\n" un
+less $result eq $expected;
}
| [reply] [d/l] |
Re: Stop runaway regex
by taint (Chaplain) on Jun 02, 2014 at 16:11 UTC
|
Warning: what follows is purely conceptual.
Perhaps I'm just too simple minded;
But wouldn't it be possible to create a file, as a pseudo pid file. Then check against it's mtime, or atime, for use as a timer. Then wrap the/whatever time-sensitive process in a for/while loop. All the while checking the time since the file was created, if that time has exceeded that time. Issue an exit?
Using system (*NIX). Part of it might look like:
my $arg1 = ('-XP . -type l -cmin');
my $arg2 = ('+15');
my $arg3 = ('xargs rm');
system("/usr/bin/find $arg1 $arg2 | $arg3");
Utilizing the above, one could even make the loop depend upon the existence of the "pseudo pid" file. No?
Like I said; conceptual. But I used a similar approach for something else "time sensitive/related (which is where the code above was taken), and it works great.
--Chris
”λɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH
| [reply] [d/l] |