http://qs321.pair.com?node_id=382673


in reply to Re: The Deceiver
in thread Why does a Perl 5.6 regex run a lot slower on Perl 5.8?

Thank you for your quick answer, and thanks again for taking the time to explain. Although I agree that the two regular expressions have different meanings, the real question here is why Perl 5.6.1 is 500-1,000 times faster than Perl 5.8.0 on the same regular expression -- this is my real query. Am I to assume that Perl 5.6.1 did not properly parse certain regular expressions and Perl 5.8.0 now does? I just tried your regular expressions and they yielded the same results under both versions. How unstable is my previous code, if new versions can make it obsolete in performance, as if encouraging not to upgrade.

Replies are listed 'Best First'.
Re^3: The Deceiver
by itub (Priest) on Aug 13, 2004 at 14:41 UTC
    #reg.pl $s = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyRRRRyyyy\n" x 500; $n = 0; $n++ while ($s =~ /(.*?)RRRR\1/sg); print "$n matches\n";
    
    time ~/bin/perl5.8.0 reg.pl 
    500 matches
    
    real    0m4.836s
    user    0m4.800s
    sys     0m0.010s
    
    time ~/bin/perl5.6.1 reg.pl 
    0 matches
    
    real    0m0.020s
    user    0m0.020s
    sys     0m0.000s
    
    So, in fact, you are complaining that a bug got fixed. The problem is that these are extremely inefficient regular expressions because they involve a lot of backtracking. I recommend reading Mastering Regular Expressions for a detailed explanation.
      That's very good to hear. What's not good is that code that relied on nothing like backreferencing regexps got squashed in the upgrade process. Like japhy guessed, the regexp failed to match, but wouldn't it make sense even for a /(.*)TEXT\1/ regexp to first look for /TEXT/ and then worry about getting the appropriate group match (be it greedy or reluctant)? This slowing down is a terrible shock some people might get (including me) when moving old code to new code. But on another note, I do agree I'm a long way from mastering regular expressions.
        In a perfect world, code should be correct and fast. Perl's code used to be wrong and fast. Now it is correct and slow. Correct and slow is generally better than wrong and fast. In time it is likely to speed up again and we'll all be happy.

        Unfortunately you began with the rude shock of seeing an amazing slowdown. Therefore while in other circumstances you might agree that you want the right answer, anything below the speed which you were accustomed to is bad.

        On the specific optimization that you offer, you're right and wrong. You're right that you can optimize that one regular expression that way and it would be good for that regular expression. But it wouldn't speed up the one that you did want to run. Furthermore adding a check for that special case would slow down the compilation of every other regular expression out there (including the one that you wanted to run). Furthermore you've just added a code path that has potential bugs which might not get caught.

        This is not to say that you never want to speed up special cases - of course you do and the regular expression engine has a lot of special tricks. But you have to balance out what is sped up by any one trick against how it slows other people down and causes opportunities for bugs to lurk.

        That said, I'd like to point out why the optimization that you point out would not solve your problem. It would tell how to solve a particular expression that you weren't running. The one that you tried to run is different enough that the optimization would probably not run. What you actually would have benefited from is an optimization that says, "Check that there are no backreferences within the RE, then turn on the old special case optimizations." Which might or might not work out to be worthwhile. (And I do not wonder that japhy just chose to turn the optimization off rather than put a test that is that complicated in.)

        If I remove the backreference by changing the regex in my example to $n++ while ($s =~ /(.*?)RRRR/sg);, I get the following:
        time ~/bin/perl5.8.0 reg.pl
        500 matches
        
        real    0m0.018s
        user    0m0.010s
        sys     0m0.010s
        
        time ~/bin/perl5.6.1 reg.pl
        1 matches
        
        real    0m0.015s
        user    0m0.010s
        sys     0m0.000s
        

        So at least in this case Perl 5.8.0 doesn't have a speed problem. I don't know exactly what's going on in your code though.

Re^3: The Deceiver
by japhy (Canon) on Aug 13, 2004 at 14:08 UTC
    I'm not entirely sure why the regexes were so much slower, unless they just never could actually match. In that circumstance, /.*FAIL/ would be a lot slower than /^.*FAIL/.
    _____________________________________________________
    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

      And since perldeveloper removes the "whatever"s from the string the regexp will fail in the last iteration. So I would not be surprised if most of the wasted time was in the last iteration :-)

      Jenda
      Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
         -- Rick Osborne