Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Stop runaway regex

by yiannis2014 (Initiate)
on May 29, 2014 at 15:25 UTC ( [id://1087822]=perlquestion: print w/replies, xml ) Need Help??

yiannis2014 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, is there a way to stop a runaway regular expression?

I am not interested in suggestions on how to modify it. I know it can be modified so it doesn't break, etc, but I am running a single regex against thousands of inputs, so modifying it means I need to retest it on *all* the inputs.

So the exact question is: is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete?

Replies are listed 'Best First'.
Re: Stop runaway regex
by davido (Cardinal) on May 30, 2014 at 04:46 UTC

    This question was also posted on StackOverflow, and I took some time answering it there based on experience I acquired when an application I wrote needed to accept user regexes, and needed to not succumb to DOS attacks. Here's a recap:

    • alarm is inadequate; it cannot interrupt the running regexp engine.

    • Sys::SigAction provides a function called timeout_call, which is capable of interrupting the regexp engine while it's running. However, the RE engine was not designed for this possibility. It can be left in an unstable state, which can (and often enough will) lead to segfaults (tested on various versions of Perl). This is usually undesirable. Presumably POSIX::SigAction will share the same weakness, as the weakness is really the fact that the RE engine isn't designed to be interrupted.

    • If your regular expression will work with the RE2 engine, you are in luck, because it guarantees linear-time searches. There is a CPAN module that interfaces with the RE2 engine, as a drop-in replacement for Perl's engine: re::engine::RE2. Here's the big catch though: The "linear-time" guarantee comes at the cost of many of the powerful regex semantics we've come to expect with Perl's elaborate RE engine. For example, RE2 has no backreferences, nor zero-width assertions. If you need those, this won't work for you. But if you can live with its limitations, it is a fantastic option (assuming you've got a recent enough Perl to use it).

    • The best solution that provides the full semantic power of Perl's regular expressions, while also providing the ability to safely time out, is the fork/alarm/wait idiom. Fork a worker, set an alarm, wait, and if the alarm expires, shut down the worker. No need to worry about the process becoming unstable; you're done with it anyway.


    Dave

        Done. (I added a mention to the bottom of my response to the SO question.)


        Dave

Re: Stop runaway regex # alarm , /g , (?{code})
by LanX (Saint) on May 29, 2014 at 15:37 UTC
    > is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete

    No! But you might want to experiment with alarm and stop the whole program with die².

    > I know it can be modified so it doesn't break, etc,

    while I don't like the way you are asking, I'll give you two general hints:

    • you might wanna restructure your regex with /g-modifier and pos such that it works within a loop. Then you are able to execute Perl code to measure the runtime.

    • as a (more experimental) variation of the former: Regexes can embed perlcode with (?{code}) syntax, you may want to experiment with this and die after a timespan.¹

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    ¹) I said die cause I've never seen any return-like exit from a regex ...which doesn't mean it's impossible² (?). ( But I am not interested to test it for you :)

    update

    ²) well you can always catch die from within eval { BLOCK }

      > But I am not interested to test it for you :)

      Well curiosity won!

      return causes an error and "won't stay shared" lexical variables are problematic too.

      Though this works and can be extended:

      use strict; use warnings; my $start; my $diff; my $timeout; sub tst { $timeout=shift; my $str = "a"x10000; $str .= "b"; $start = time; $str =~ /^ (( a* (?{ $diff= time-$start; die "stopped after $diff sec" if $diff >=$timeout; }) )*)* $/x; } tst(10);

      output:

      Complex regular subexpression recursion limit (32766) exceeded at time +out_regex.pl line 15. stopped after 10 sec at (re_eval 1) line 3.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      update

      NB: this approach considerably slows down the regex when done in the innermost group-loop, outer loops OTOH might be checked too seldom. YMMV

        return causes an error and "won't stay shared" lexical variables are problematic too.

        Get newer perl (latest), this part upgraded

        In the meantime, local our $diff; ...

      Well, while the docs look promising, alarm didn't interrupt regexes for me!

      Signals are executed only after the regex is completed. (5.10 / Linux)

      Found this thread "Timeout alarm for regex", so this always used to be problematic!

      (see also Deferred Signals (Safe Signals))

      No idea if it's still the case for newer Perl versions.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

Re: Stop runaway regex
by mr_mischief (Monsignor) on May 29, 2014 at 19:48 UTC

    As LanX had guessed, non-deferred signals can do what you want. It's not that pretty, but it works.

    #!/usr/bin/perl use strict; use warnings; $|++; my $s = 0; use POSIX qw(SIGALRM); POSIX::sigaction(SIGALRM, POSIX::SigAction->new( sub { warn "skipping +$s (took too long)\n"; die } )) || die "Error setting SIGALRM handler +: $!\n"; my $str = 'ffvadsvefwdvewrfvt4vketwrhjkbveqwkjhfkghjlfghjkufghjkfhjkfj +kgfghfkhjfkhjgfhjgfhgfhkgfhkgfhkgfhkgfkhjgfkjgfkghjfkhjgfhjgfkhjgfhjk +fk' x 40960; $str .= 'hjkbklklhbjklercvqewrqereqrfqeerv;;;jnrveervnlknrvlerlvnerlnv +elrvnervlkenvlervojubnertvffff;kn;kff;kn;fk;k;;kmnff;knmf;nff;mnkf;;k +;;' x 40960; my $str2 = $str x 8; my $str3 = 'furrfu'; my $re = qr/(f((\w?)(\w*?))?)+/; print time . "\n\n"; for ( $str, $str2, $str3 ) { $s++; my $res; alarm 2; eval { $res = $_ =~ s/$re/ ^_^ /g; }; print "$s made $res\n" unless $@; } print "\n" . time . "\n"; exit;

    This puts out something similar to the following, given the system is slow enough to take more than a second on that second string but not on the first (or third -- gosh, let's hope not!).

    1401392795 1 made 1310721 skipping 2 (took too long) 3 made 2 1401392798
    This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-th +read-multi-2level (with 3 registered patches, see perl -V for more detail)

    Update: Per davido's advice in the thread, I'll point out that the above is tested but not thoroughly so. One might hope that the die() and ending the eval would be enough unrolling of state that no segfaults or other wonkiness would happen when the regular expression engine is interrupted and reinvoked. If not, the forking model does make a lot of sense. One might also put code for handling one regex at a time into a separate script and hand off to that with something like IPC::Open3 which will handle parts of the child management and inter-process communication for you.

      FYI:

      Even w/o deferred signals I was able to create segfaults with alarm (5.10).

      Just by putting simple (?{code}) into the regex to catch the alarm-signal.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        I was unable to get a segfault testing the exact code I posted on either 5.10.1 on CentOS or 5,16,2 on OSX. I only tested it with the strings and regex listed in the code, though.

Re: Stop runaway regex
by DrHyde (Prior) on May 30, 2014 at 11:07 UTC
    There's an example in CPAN::ParseDistribution::Unix. It uses Parallel::ForkManager to run the potentially time-consuming code in a seperate process, communicate its results back to the parent process, and time it out if it takes too long. This code only works reliably on Unix. If you can find a way of making it work on legacy platforms then I'd love to know.
Re: Stop runaway regex
by Anonymous Monk on May 29, 2014 at 15:35 UTC
    modifying it means I need to retest it on *all* the inputs.

    This should be trivial to do. One file full of input strings and one full of their expected results.

    while (my $input = <$testInputsFH>) { my $result = doBigRegexOn($input); my $expected = <$expectedResultFH>; warn "Test failed; got $result instead of $expected on line $.\n" un +less $result eq $expected; }
Re: Stop runaway regex
by taint (Chaplain) on Jun 02, 2014 at 16:11 UTC
    Warning: what follows is purely conceptual.

    Perhaps I'm just too simple minded;
    But wouldn't it be possible to create a file, as a pseudo pid file. Then check against it's mtime, or atime, for use as a timer. Then wrap the/whatever time-sensitive process in a for/while loop. All the while checking the time since the file was created, if that time has exceeded that time. Issue an exit?

    Using system (*NIX). Part of it might look like:

    my $arg1 = ('-XP . -type l -cmin'); my $arg2 = ('+15'); my $arg3 = ('xargs rm'); system("/usr/bin/find $arg1 $arg2 | $arg3");

    Utilizing the above, one could even make the loop depend upon the existence of the "pseudo pid" file. No?

    Like I said; conceptual. But I used a similar approach for something else "time sensitive/related (which is where the code above was taken), and it works great.

    --Chris

    ”λɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1087822]
Approved by marto
Front-paged by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 23:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found