http://qs321.pair.com?node_id=11109587

Random_Walk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings wise and ancient monks of the order of Perl

When a regex is run it may take a few steps to match a string, it may back track a little, or it may enter the world of catastrophic backtracking.

We have an application where users can create regex patterns to run against a logfile Sometimes the user created patterns stray into the zone of catastrophe, and the servers can spend more CPU trying to match patterns in the logfile than doing real work.

I would like a way to pre-validate user's regex against a few hundred sample lines, and ensure that they never enter the world of ridiculous step count. Is there a way to get the regex engine to tell me how many steps it took?

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

Replies are listed 'Best First'.
Re: regex step counting
by LanX (Saint) on Dec 03, 2019 at 13:25 UTC
    Pragmatic approach: you could use re qw/debug/ and count the lines (or do more intelligent grepping)

    see re

    EDIT: But be warned that executing unfiltered regexes from user input is as dangerous as executing plain Perl code.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    UPDATE

    DB<3> use re "debug"; print 'abcdc' =~ /(.).*\1/ Compiling REx "(.).*\1" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: CLOSE1 (6) 6: STAR (8) 7: REG_ANY (0) 8: REF1 (10) 10: END (0) minlen 1 Matching REx "(.).*\1" against "abcdc" 0 <> <abcdc> | 0| 1:OPEN1(3) 0 <> <abcdc> | 0| 3:REG_ANY(4) 1 <a> <bcdc> | 0| 4:CLOSE1(6) 1 <a> <bcdc> | 0| 6:STAR(8) | 0| REG_ANY can match 4 times out of 2 +147483647... 5 <abcdc> <> | 1| 8:REF1: "a"(10) | 1| failed... 4 <abcd> <c> | 1| 8:REF1: "a"(10) | 1| failed... 3 <abc> <dc> | 1| 8:REF1: "a"(10) | 1| failed... 2 <ab> <cdc> | 1| 8:REF1: "a"(10) | 1| failed... 1 <a> <bcdc> | 1| 8:REF1: "a"(10) | 1| failed... | 0| failed... 1 <a> <bcdc> | 0| 1:OPEN1(3) 1 <a> <bcdc> | 0| 3:REG_ANY(4) 2 <ab> <cdc> | 0| 4:CLOSE1(6) 2 <ab> <cdc> | 0| 6:STAR(8) | 0| REG_ANY can match 3 times out of 2 +147483647... 5 <abcdc> <> | 1| 8:REF1: "b"(10) | 1| failed... 4 <abcd> <c> | 1| 8:REF1: "b"(10) | 1| failed... 3 <abc> <dc> | 1| 8:REF1: "b"(10) | 1| failed... 2 <ab> <cdc> | 1| 8:REF1: "b"(10) | 1| failed... | 0| failed... 2 <ab> <cdc> | 0| 1:OPEN1(3) 2 <ab> <cdc> | 0| 3:REG_ANY(4) 3 <abc> <dc> | 0| 4:CLOSE1(6) 3 <abc> <dc> | 0| 6:STAR(8) | 0| REG_ANY can match 2 times out of 2 +147483647... 5 <abcdc> <> | 1| 8:REF1: "c"(10) | 1| failed... 4 <abcd> <c> | 1| 8:REF1: "c"(10) 5 <abcdc> <> | 1| 10:END(0) Match successful! cFreeing REx: "(.).*\1" DB<4>

      Thanks, I like this answer.

      While talk of the halting problem further down is interesting, I have a real world situation now where users are putting out bad regex on production servers. The regex they have access ltoo is actualy not Perl, but a very limited subset of Java's regex. It has .* though and they manage to shoot themselves in the foot with abundant use of .* I don't really need 100% accurate count of how many steps they are taking, just an idea if they are in the region of up to a few hundred steps per line, or way beyond it. If they breach some arbitrary limit we will ask them to come to us for regex guidance. So counting the lines from a re debug will be close enough for my purpose.

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        > not Perl, but a very limited subset of Java's regex

        I was afraid you'd say this, different engines do different guessing and optimizations.

        Something very fast in Perl can be a nightmare elsewhere, and vice versa.

        > I don't really need 100% accurate count of how many steps they are taking, just an idea if they are in the region of up to a few hundred steps per line

        It really depends on how limited your grammar is.

        The point why I mentioned the halting problem is, that the proof shows a way to construct a pathological case for every approach.

        In other words, if your tests are limited it's possible to construct a troublesome regex which will pass them. (Provided the grammar is rich enough)°

        Hence no guaranties whatsoever!

        Good luck! :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        °) actually it's pretty trivially done, if the regex has no length restriction.

      But be warned that executing unfiltered regexes from user input is as dangerous as executing plain Perl code.

      nonsense

      no re 'eval';

Re: regex step counting
by Fletch (Bishop) on Dec 03, 2019 at 13:56 UTC
Re: regex step counting (Halting problem)
by LanX (Saint) on Dec 03, 2019 at 14:29 UTC
    > and ensure that they never enter the world of ridiculous step count.

    Sidenote: Are you aware of the Halting problem?

    In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever.

    Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist.

    update

    though there is an ongoing discussion if Perl RegExes (without embedded (?{Perl code}) ) are Turing complete and which features should be stripped off to make them "uncomplete".

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      You don't need to solve the halting problem to enforce and/or check that a regex stops after a time limit or step limit.

      If you're launching the regex engine with user-supplied input, I'm not sure whether you can single-step through the regexp, but most likely you can do something like what Regexp::Debugger does and limit the amount of steps there.

      Alternatively, declare an upper time limit which you allow for user-supplied regular expressions.

      The problematic part of the halting problem only comes into play for stuff that runs very long but not infinitely long. For the OP, these two cases fall into the same category.

        I never said that he has to solve the halting problem.

        His approach is to test against a set of input strings.°

        I wouldn't be surprised if a finit input set can't cover all cases for arbitrary regexes.

        This also highly depends on the allowed RegEx grammar, like embedded Perl code (at the extreme).

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        °) "a few hundred sample lines"