Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Benchmarks aren't everything

by tilly (Archbishop)
on Oct 24, 2005 at 08:01 UTC ( [id://502408]=perlmeditation: print w/replies, xml ) Need Help??

While it is nice to win benchmarks, sometimes other things are more important. Such as not halting people's programs.

I was reminded of this recently when I ran across a benchmark showing that Perl's regex engine was a lot slower than Java's. The benchmark was at http://www.tbray.org/ongoing/When/200x/2004/08/22/PJre. I sent Tim Bray (the author) a note explaining why Perl's engine was slower and why this is a Good Thing™, and he said that if I wrote it up somewhere, then he would point to it.

This is that write up.

The short version of the story is that Perl has accepted a significant performance penalty to reduce the odds that programs will accidentally get "stuck" and never finish a match. Whenever people write benchmarks, they only test the common case and so see the significant performance penalty. They don't see the bug reports that the change avoids.

The slightly more detailed version is that starting with Perl 5.6, Perl's RE engine avoids many of the exponential slowdowns that Mastering Regular Expressions talks about, but at the cost of tracking a lot of extra data. This fixes a lot of hard to find bugs at the cost of losing performance benchmarks.

The long version follows.

A lot of what follows will be very familiar for people who understand Mastering Regular Expressions. After all one of the biggest reasons for writing that book was to help more people understand the problem that I'm about to describe, and understand how to avoid it.

Suppose we have a pattern, like /^(\s*foo\s*)*$/ and we want to try to match it to a string like "foo foo fo". How do we do that?

Well there turn out to be a couple of strategies, but the one which virtually everyone uses (read the book for what the other one is, and why people use this) can be described as follows: do a recursive search for ways to try to match the pattern to the string. That is, the sets of combinations of choices that one might make in trying to match the pattern to the string (should this .* include the next character?) form a tree that you can do a recursive search through, looking for a match.

An example will help show what I mean, so let's try to match the above RE to a string that it succeeds with, "foo foo". I'll underline the matched part at each step, and put parens to indicate what choices it has made:

# It is unambiguous how to match to here:
(foo foo
  # Let's try putting the space in the first group:
  (foo )foo
    # Proceeding on we get to this match:
    (foo )(foo)</code>
There is another possible match, but the above is the one we will find, as may be verified by looking in $1.

Now let's try an example that fails, namely "foo foo fo":

# It is unambiguous how to match to here:
(foo foo fo
  # Let's try putting the space in the first group:

  (foo )foo fo
    # It is unambiguous how to match to here:
    (foo )(foo fo
      # Let's try putting the space in the second group:
      (foo )(foo )fo
        # FAIL
        (foo )(foo )(fo
      # Perhaps don't put the space in the second group?
      (foo )(foo) fo
        # FAIL
        (foo )(foo )(fo
  # Perhaps don't put the space in the first group?
  (foo) foo fo
    # It is unambiguous how to match to here:
    (foo)( foo fo
      # Let's try putting the space in the second group:
      (foo)( foo )fo
        # FAIL
        (foo)( foo )(fo
      # Perhaps don't put the space in the second group?
      (foo)( foo) fo
        # FAIL
        (foo)( foo)( fo
# Oops, all options failed, it doesn't match.
A careful observer may note a couple of things. First of all this clearly works, the algorithm is not very hard to follow and clearly must find an answer. Secondly it seems to take a lot less work to find a match than it does to prove that there is no match.

A very clever observer might notice that the first observation is wrong, and the second is worse than it seems in this example.

How many times do you reach the end and FAIL if you try to match "foo foo foo fo"? Well the brute force solution will try every combination of ways to put the first space in the first group or not, the second in the second group or not, and the third in the third group or not. So we go from 4 to 8.

If you try to match "foo foo foo foo fo" then you add one more choice, so you go from 8 to 16.

If you try to match the string ("foo " x 100) . "fo" then it will FAIL 2**100 times, which is longer than the expected lifetime of our planet and definitely longer than the expected lifetime of your computer, let alone your program.

Which is what the really clever observer noticed. That failing takes more work, sometimes so much so that it can't actually finish!

Now in some sense this is unavoidable. As http://perl.plover.com/NPC/ notes, figuring out when a Perl regular expression matches a string is an NP-hard problem. Therefore there must be exponential failure modes. Life sucks. Sorry.

But what Ilya Zakharevich realized is that while it is true that there must be exponential slowdowns, it is not true that we must make them be so easy to hit. Very few actual bug reports use backreferences, or any other feature that requires matches to be slow. So why can't they be sped up?

Now to be fair, he was not the first with this revelation. Most decent RE engines have logic in place to avoid some common instances of this problem. But his solution was novel.

What is the problem we are trying to solve? The problem is that we wind up revisiting sets of choices which can't possibly make a difference about where we match. For instance in the example above, who cares which grouping each space goes in? It obviously won't affect whether you match. What we would like is some way to say, "I've tried this reasoning before, and it didn't work that time so don't try it again."

Ilya's answer for how to do this was quite simple (though as that link proves, the implementation was not quite so simple nor was the initial try bug free). His idea is that you could track what combinations (position in REx, position in string) you had visited. If you encounter a combination that you have been to before, then you'll get the answer that you got before, which must have been failure because you're here again! OK, you can't always do this, there are constructs (eg backreferences) that ruin your reasoning (because what you put into a grouping might affect whether you match), but this certainly covers most common REs.

Let's try this optimized version on the failure case above. Here is what happens:

# It is unambiguous how to match to here:
(foo foo fo
  # Let's try putting the space in the first group:
  (foo )foo fo
    # It is unambiguous how to match to here:
    (foo )(foo fo
      # Let's try putting the space in the second group:
      (foo )(foo )fo
        # FAIL
        (foo )(foo )(fo
      # Perhaps don't put the space in the second group?
      (foo )(foo) fo
        # FAIL (been here before)
        (foo )(foo )(fo
  # Perhaps don't put the space in the first group?
  (foo) foo fo
    # FAIL (been here before)
    (foo)( foo fo
# Oops, all options failed, it doesn't match.
The amount of reasoning just went down, and we only have to FAIL 3 times, not 4. For the string "foo foo foo fo" we FAIL 4 times instead of 8. And "foo " x 100) . "fo" will fail 101 instead of 2**100 times, so will actually finish!

The cost, unfortunately, is that we need to have a vector of bits in which we track what combinations we've been to. Just allocating that space and flipping bits is significant overhead on the success case. So we get rid of bug reports but lose benchmarks.

UPDATE: blakem is correct, I was missing a * in my regex. That's what I get for posting after midnight after a long day. Fixed.

Replies are listed 'Best First'.
Re: Benchmarks aren't everything
by blakem (Monsignor) on Oct 24, 2005 at 08:45 UTC
    I think you're missing a char in your example regex... shouldn't the regex look something like: /^(\s*foo\s*)*$/

    -Blake

      Thanks, I was. Fixed.
Re: Benchmarks aren't everything
by demerphq (Chancellor) on Oct 24, 2005 at 17:51 UTC

    Everytime I've looked at that regex ive thought the same thing: what a strange regex to use for a comparison benchmark.

    One optimisation that is missing from pre 5.9 Perl is that there is nothing to convert /x|y|z/ into the far more efficient /[xyz]/.* If javas regex engine does contain a compile time optimisation of this sort its almost guaranteed to run faster than released version of Perl. On 5.9 and later it would be handled somewhat better as a TRIE which would be much faster than before, but still not as efficient as if it had been a char class. (Any C programmers out there interested in a neat hack, converting TRIE regops with only one row in their table to proper char class nodes would be cool).

    This factor coupled with the fact that he doesnt actually show what strings he matched against make me somewhat dubious of the value of his assertions.

    Without looking at the optimisations built into the Java regex engine (which no doubt they do not release) its hard to say why Perl would be slower than Java. I personally wouldnt necessarily assume that it was the backtracking logic that was at fault without a better understanding of the test cases that he used for his timings.

    * Update: IOW its probably not fair to penalize a system for performing one task inefficiently when there is an equivelent construct that is efficient, unless in the context of an overall comparison.

    ---
    $world=~s/war/peace/g

      As a follow up here is a benchmark of the regex as posted against the equivelent regex that uses classes and not alternation, note the drammatic difference in run time. ($rex1 is the regex without alternations, $rex2 is the regex as posted, with newlines inserted to make it moderately easier to read)

      which has the following output:

      Benchmark: running rex1, rex2 for at least 1 CPU seconds... rex1: 1 wallclock secs ( 1.16 usr + 0.00 sys = 1.16 CPU) @ 51 +.04/s (n=59) rex2: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 12 +.99/s (n=14) Rate rex2 rex1 rex2 13.0/s -- -74% rex1 50.6/s 290% -- 6291 6291

      So by removing the alternation we see a 290% speedup. What was the run time difference in the benchmark being referenced? Well apparently Java took 244 seconds and Perl took 527, which afaict (my maths arent so hot) is %115 faster. Now perhaps Java also would benefit from this modest optimisation of his code, but perhaps not, perhaps the Java implementation does this internally and in fact would not see an improvement. Either way the case does not seem to be proved at all.

      BTW, for my benchmarks i just used the source for a page of Perlmonks. Use whatever XML/HTML you like.

      ---
      $world=~s/war/peace/g

Re: Benchmarks aren't everything
by perrin (Chancellor) on Oct 24, 2005 at 14:27 UTC
    I wrote something about this benchmark too, at the time.
      All of your comments are valid as well. However, modulo the missing Unicode support, Perl 5.005_03 would have performed similarly to, or better than, Java.

      I find it interesting that cl-ppcre was suggested in Dom2's response. I know for a fact that they were also proud of beating Perl by around a factor of 2, but when I checked they did not handle this catastrophic case. I pointed out to the main author that they did not, and that Perl was about twice as fast until it did. There was some back and forth about possible solutions, and I don't know whether anything was ever implemented on their end. (I suspect not.)

Re: Benchmarks aren't everything
by Ovid (Cardinal) on Oct 24, 2005 at 18:05 UTC

    It seems Tim Bray made a classic benchmarking mistake. He compared dissimilar things and thinks the result is meaningful. On the other hand, it would be very difficult for most observers to have any idea of how Perl and Java regular expressions are really implemented, so I can hardly fault him for this. I likely would have come to the same conclusion despite my knowing both languages.

    Cheers,
    Ovid

    New address of my CGI Course.

      Im not so convinced that its not appropriate to compare the two. I just think its inappropriate to do such a limited comparison, without showing your data, and then derive such grandiose claims from it.

      Id like to a see a comprehensive comparison of the various regex engines out there. One that includes all the features that all of the engines can handle, so you tell which engine does what.

      ---
      $world=~s/war/peace/g

Re: Benchmarks aren't everything
by zby (Vicar) on Oct 24, 2005 at 12:07 UTC
      Thinking about it further, I think it is possible to implement this solution with less overhead. If your goal is merely to prevent disasters, then the right optimization is one which is fast in the common case and makes the bad case run, not one which tries to make the bad case fast as well.

      An example of how one might do that is have a flag in every RE node saying whether we are tracking positions in the string that we've been to. If we are, then have a linked list that you scan to find whether your current position has been reached yet. If you're not, then do nothing. Then insert code that once every large number of steps sets this bit in various nodes. (For instance the check might happen when you fail back to a second alternative - do that 10,000 times and then set the bit on your current node.)

      On fast regular expressions all you do is check a flag. On slow ones, you run a lot slower (a linked list is a lot of overhead), but at least you still run.

      Yes, but the implementation is rather more efficient than a naive implementation of dynamic programming would be likely to be.
Re: Benchmarks aren't everything
by pg (Canon) on Oct 24, 2005 at 16:46 UTC

    I agree with what you have said within its own context. It is more important to make the stuff correct than to make it faster. No problem here.

    But your write up cannot be used against Tim Bray's, as you did nothing to compare Perl's regexp against Java's as the original author did. What you did was to compare Perl's with Perl's, the Perl one that was faster but not quite right with another Perl one that is slower but right.

    What if the Java one is not only faster but also correct? This is the key question you need to answer. If both are right, faster is obviously better.

      Given the history of regular expression implementations, it would be very surprising if Java actually did handle this correctly. Until Perl implemented a solution, the best approach that I'd heard of anyone using for this problem was to first do a DFA match to see if it matched at all, then an NFA match to figure out what the match was.

      I don't have a Java compiler handy. But, for instance, this was not handled in Perl 5.005_03, and as of the last time that I checked, was not handled in Ruby, pcre (which is widely used in other applications), or pp-ccre.

      If someone wishes to run a test Java program to see whether it has this optimization, I'd be interested to know the result.

      I tried matching "^(\\s*foo\\s*)*$" against 39 "foo "s followed by a "fo" and java.util.regex.Pattern.matches does indeed go off into recursive space never to re-emerge.

      Personally I think it would be acceptable for Java to throw an exception after some bounded amount of effort in matching. Especially given that Perl's new algorithm can't catch all the pathological cases.

Re: Benchmarks aren't everything
by schweini (Friar) on Oct 24, 2005 at 20:19 UTC
    Impressive write-up!

    But it got me wondering whether there is any way to disable all that funky checking the perl regexp engine does?
    In most cases, I use rather basic regexps, so I think it would be great if there'd be an option like:
    local $regexp_checking = undef
    or something like that.
    Additionaly, is there any way to use different regexp-engines? I seem to recall that in some perldoc (the Camel book, maybe?) it mentioned something about the possibility to roll one's own regexp engine, but I never noticed anyone actually doing that.
    Anyhow, wouldn't it be nice if there was a way do easily choose between different engines?
      Anyhow, wouldn't it be nice if there was a way do easily choose between different engines?
      Yeah, but since there isn't an alternative engine available, and I've never seen anyone attempting to write an alternative regexp engine for Perl, it doesn't make sense to put effort into making switching regexp engines easy.

      Write an alternative engine, and I'll bet someone will make a pragma to switch between them. That happened with 'sort' as well. After making merge sort available for Perl, the sort pragma was created.

      Perl --((8:>*

        Write an alternative engine, and I'll bet someone will make a pragma to switch between them.

        Actually afaiui the latter issue is the problem. We already have a framework for "swapping out" the normal regex engine via the use re 'debug'; facility. The problem is that the framework more or less expects the swapped in engine ot behave just the same as the one swapped out. IE, it must set the same globals, and behave in the same quirky way. If this problem could be cleaned up providing an alternative engine would probably be feasable.

        ---
        $world=~s/war/peace/g

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://502408]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-28 13:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found