Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Why does a Perl 5.6 regex run a lot slower on Perl 5.8?

by perldeveloper (Scribe)
on Aug 13, 2004 at 12:59 UTC ( [id://382646]=perlmeditation: print w/replies, xml ) Need Help??

I'll cut to the chase: the same Perl code runs under Perl 5.8.0 (and Perl 5.8.5) a lot slower. What does a lot mean? Well, in this very case I'm presenting here, it means about five hundred times slower. Since I cannot believe that there exist recompiling options that can make Perl run 500 times slower/faster, one of the following must hold:
  • I'm using non-standard code (like attributes, prototypes etc.)
  • Perl 5.8.x is not backwardly compatible, at least when it comes to running the same code `about as fast as the very previous version does'

Since my code resembles Chapter 3 in a Perl textbook (regular expressions and IO reading/writing), the second one must be true: Perl 5.8.x is not backwardly compatible. I would expect that when it comes to obscure functionality or old deprecated functionality, but I wouldn't expect it when it comes to regular expressions. Regular expressions are the main reason why I chose Perl; if that breaks down, I might as well forget about Perl altogether and stick to Java and the ubiquitous Python (which is already the preferred choice over Perl in web development).

Out of decency towards the Perl community, I feel obliged to spend some time before jumping to conclusions, and examine my tests on three versions of Perl: 5.6.1, 5.8.0 (shipped with RedHat9), and 5.8.5, a very lite hand-made compilation, built for performance and no extra specialized functionality. However, I do not have the time nor the resources to conduct a test on another operating system. RedHat 9 is however a standard Linux operating system and this outrageous behavior is most probably common to many others, if not all.

The following tables outline debugging information obtained running perl -d:DProf and then dprofpp tmon.out.

Perl5.6.1
Total Elapsed Time = 0.080048 Seconds User+System Time = 0.080048 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 87.4 0.070 0.070 40 0.0018 0.0018 main::extract 12.4 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN 0.00 0.000 0.010 2 0.0000 0.0050 main::BEGIN 0.00 0.000 0.000 1 0.0000 0.0000 warnings::import 0.00 0.000 0.000 1 0.0000 0.0000 strict::import 0.00 0.000 0.000 1 0.0000 0.0000 strict::bits 0.00 0.000 0.000 1 0.0000 0.0000 Exporter::import 0.00 0.000 0.000 1 0.0000 0.0000 warnings::bits
Perl5.8.0
Total Elapsed Time = 123.5199 Seconds User+System Time = 39.62993 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 97.1 38.49 38.520 40 0.9622 0.9630 main::extract 0.05 0.020 0.020 1 0.0200 0.0200 utf8::SWASHNEW 0.03 0.010 0.010 1 0.0100 0.0100 utf8::AUTOLOAD 0.00 - -0.000 1 - - utf8::SWASHGET 0.00 - -0.000 1 - - Exporter::import 0.00 - -0.000 1 - - warnings::unimport 0.00 - -0.000 2 - - warnings::import 0.00 - -0.000 1 - - warnings::BEGIN 0.00 - -0.000 2 - - strict::unimport 0.00 - -0.000 4 - - strict::bits 0.00 - -0.000 2 - - strict::import 0.00 - -0.000 3 - - main::BEGIN 0.00 - -0.000 5 - - utf8::BEGIN
Perl5.8.5
%Time ExclSec CumulS #Calls sec/call Csec/c Name 98.4 0.630 0.630 40 0.0157 0.0157 main::extract 1.56 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN 0.00 - -0.000 1 - - warnings::import 0.00 - -0.000 1 - - strict::import 0.00 - -0.000 1 - - strict::bits 0.00 - 0.010 2 - 0.0050 main::BEGIN
The main::extract subroutine takes about 9 times longer under Perl 5.8.5, and 549 times more under Perl 5.8.0, compared to Perl 5.6.1. The program itself took 1,543 times longer to finish under Perl 5.8.0 than it did under Perl 5.6.1. You may be wondering what the Perl program is:
use strict; use warnings; open (FILE, "a.txt"); my $text = ""; while (<FILE>) { $text .= $_; } close (FILE); while (my ($one, $two) = extract ($text)) { $text = $one . $two; } sub extract { my ($text) = @_; if ($text =~ /(.*?)whatever(.*)/is) { return ($1, $2); } return (); }
As you can see, this code slurps a file and removes all occurences of a certain word (`whatever'). If you're wondering why Perl 5.8.0 took 2 minutes, it's not because I was using a larger file, and it's not because the file was large. The size of the file was exactly 11,221 (about ten thousand) bytes.

When the /.*? regular expression is changed to /^.*? (an explicit version of the same regexp), and instead of a 10,000 byte file, a 5,000,000 byte file is used, here are the debugging results for the main::extract subroutine:

Perl 5.6.1
%Time ExclSec CumulS #Calls sec/call Csec/c Name 88.1 0.670 0.670 1 0.6700 0.6700 main::extract
Perl 5.8.0
%Time ExclSec CumulS #Calls sec/call Csec/c Name 95.0 2.490 2.510 1 2.4900 2.5100 main::extract
Perl 5.8.5
%Time ExclSec CumulS #Calls sec/call Csec/c Name 19.5 0.080 0.080 1 0.0800 0.0800 main::extract
It's obvious that the little hat balanced out the differences between the three releases (although 3.7 times longer with Perl 5.8.0 is reason enough NOT to upgrade). Perl 5.8.5, in its current build was faster than Perl 5.6.1. The differences exist on account of different versions and different build parameters. To be more exact, here are the configuration summaries for the three releases:

Perl 5.6.1 Configuration Summary
usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef
Perl 5.8.0. Configuration Summary
usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define useperlio= d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef
Perl 5.8.5 Configuration Summary
usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef
The conclusion is that all regular expressions written like this:
$text =~ /(.*?)<whatever>/
take a thousand times more on 5.8.0. The same expressions written as
$text =~ /^(.*?)<whatever>/
which obviously means the same thing (look for the first occurence of <whatever> and save the text preceding it in the corresponding variables) has the same performance implications across these two versions.

In my honest opinion, This is not an issue of bad code and good code, this is an issue of good Perl and bad Perl. I've only discovered this strange behavior using standard regular expression and moving from 5.6 to 5.8, which are consecutive versions. If the changes are so dramatic when upgrading to the next version, what is one to expect of Perl in other respects?


I can tell you one thing: if IBM had written Perl, this would have never happened. Maybe there aren't enough alpha and beta testers, maybe developers don't have the time to write enough warning messages. What's certain is that Perl is not seen as a product, and the members of the community it attempts to serve are not being looked upon as customers. And that's the very difference between Open source and closed source software. What good is it's free, if it is deceiving its users about the problems it claims to solve?

Replies are listed 'Best First'.
Re: The Deceiver
by japhy (Canon) on Aug 13, 2004 at 13:16 UTC
    I am the person to blame. I made the change in the regex engine that is causing the problem you're facing. Let me explain:
    The conclusion is that all regular expressions written like this: $text =~ /(.*?)<whatever>/ take a thousand times more on 5.8.0. The same expressions written as $text =~ /^(.*?)<whatever>/ which obviously means the same thing (look for the first occurence of <whatever> and save the text preceding it in the corresponding variables) has the same performance implications across these two versions.
    Sadly, that is not true, and that is exactly what I had to change in the source of perl. You say that /(.*)X/ and /^(.*)X/, but that is a half-truth. Consider this case: "xxyyyRyyy" =~ /(.*)R\1/ If, as you state, the leading ^ is implied, the regex fails, because "xxyyy" cannot be found after the "R" as my regex requires. Only by not anchoring that regex can it ever match ($1 is "yyy").

    There is no "easy" way to fix this problem in the source of perl; you have to explicitly state the anchor yourself. The reason is that perl has no way of knowing whether or not you'll end up using what you captured as a backreference, so anchoring has an unknown effect. The problem is not only when the .* is captured, either; any capturing in the regex causes a problem.

    (The case of "abc\ndef1" =~ /.*\d/ is already handled by the engine so as not to fail. It would fail if the regex were treated as /^.*\d/, but the engine makes it (?m:^) if necessary.)

    _____________________________________________________
    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Thank you for your quick answer, and thanks again for taking the time to explain. Although I agree that the two regular expressions have different meanings, the real question here is why Perl 5.6.1 is 500-1,000 times faster than Perl 5.8.0 on the same regular expression -- this is my real query. Am I to assume that Perl 5.6.1 did not properly parse certain regular expressions and Perl 5.8.0 now does? I just tried your regular expressions and they yielded the same results under both versions. How unstable is my previous code, if new versions can make it obsolete in performance, as if encouraging not to upgrade.
        #reg.pl $s = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyRRRRyyyy\n" x 500; $n = 0; $n++ while ($s =~ /(.*?)RRRR\1/sg); print "$n matches\n";
        
        time ~/bin/perl5.8.0 reg.pl 
        500 matches
        
        real    0m4.836s
        user    0m4.800s
        sys     0m0.010s
        
        time ~/bin/perl5.6.1 reg.pl 
        0 matches
        
        real    0m0.020s
        user    0m0.020s
        sys     0m0.000s
        
        So, in fact, you are complaining that a bug got fixed. The problem is that these are extremely inefficient regular expressions because they involve a lot of backtracking. I recommend reading Mastering Regular Expressions for a detailed explanation.
        I'm not entirely sure why the regexes were so much slower, unless they just never could actually match. In that circumstance, /.*FAIL/ would be a lot slower than /^.*FAIL/.
        _____________________________________________________
        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: The Deceiver
by diotalevi (Canon) on Aug 13, 2004 at 13:33 UTC

    Ok so japhy told you why that is slow, here's a way to make your code fast regardless - don't even bother with the capturing.

    Capturing is always slow because it has to make a copy of the source string. $1, internally, is just substr( $safe_copy_of_match, $-[1], $+[1] - $-[1] ). So the largest speed hit (that I'm aware of) is the memory operation of making a safe duplicate of the data that was just matched. COW (copy on write) may mitigate this if/when it ever gets into perl.

    Likely to be be fastest. This was my second thought.

    my $whatever_index = index lc $text , $whatever; return( substr( $text, 0, $whatever_index ), substr( $text, $whatever_index + length $whatever ) );

    This may be the fastest. It was my third thought.

    my $whatever_index = index lc $text, $whatever' ; my $whatever_length = length $whatever; return unpack "a" . $whatever_index . "x" . $whatever_length . "a*", $ +text;

    This was my first thought. Use a plain regex to *locate* the thing in the string and then just substr() the equivalent of the captures out. This happens to be simplest to look at so it wins on the visual-complexity scale. This is a great general technique to avoid capturing on regexes and as such is a great post-bechmarking optimization.

    if ( $text =~ /whatever/i ) { return( substr( $text, 0, $-[0] ), substr( $text, $+[0] ); }
Re: The Deceiver
by perrin (Chancellor) on Aug 13, 2004 at 17:02 UTC
    So quick to blame the Perl community...

    Perl 5.8.0 is slow on your system because Red Hat compiled it with threads and debugging turned on (which you didn't do in your 5.8.5 compile) and because they set the locale to use unicode and folded in a bunch of patches for unicode that were not in the official 5.8.0 release. This has been written about extensively. See the Red Hat bugzilla for more details. This was fixed in 5.8.1. The remaining slowdown of 3.7 is probably due to the regex change that japhy mentioned.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Slapping People For Help
by chromatic (Archbishop) on Aug 13, 2004 at 17:29 UTC
    I can tell you one thing: if IBM had written Perl, this would have never happened. Maybe there aren't enough alpha and beta testers, maybe developers don't have the time to write enough warning messages. What's certain is that Perl is not seen as a product, and the members of the community it attempts to serve are not being looked upon as customers. And that's the very difference between Open source and closed source software. What good is it's free, if it is deceiving its users about the problems it claims to solve?

    Do you often find that insinuating that people are ignorant, malicious, sloppy, or stupid makes them likely to help you?

      Maybe I was trying to convey too many ideas and feelings out of context (on one hand I'm carefully explaining the problem and seeking professional advice, on the other hand I'm criticizing the ones responsible). In the right context, anything can be made to sound the way it was intended, and I do apologize if this particular bit sounded too harsh.

        I understand the frustration. It's sometimes difficult to remember that dozens of people have put thousands of hours into a project given away freely for other people to use when you find an apparent bug, but it's very wise to keep that in mind.

        Your description of the problem was very good, though.

      I think the point is that you don't need to worry about that if you're talking to a commercial vendor. It's a shift in attitude that anyone moving to F/OSS, hopefully, will get used to.

        I've talked to proprietary vendors before. Maybe some don't cause you to worry, but those I can think of did not inspire me with confidence.

        Barnraising your IT might be an interesting read.

        Make sure to read the comments on sentiments about commercial vendors and contracts.

        Makeshifts last the longest.

Re: The Deceiver
by TrekNoid (Pilgrim) on Aug 13, 2004 at 20:49 UTC
    I can tell you one thing: if IBM had written Perl, this would have never happened.

    Not sure I agree with *that*... As an old mainframe programmer, I can tell you that when COBOL went from COBOL to VS COBOL II back in the late 80s, we practically had to recompile our entire mainframe library.

    And it wasn't like it was obscure stuff... They did away with the EXAMINE statement, which was a staple of COBOL development.

    It did away with the ON statement... and would no longer accept LABEL RECORDS...

    Worst of all, the TRANSFORM statement vanished.

    They had (supposedly) good reasons for making those kind of fundamental changes, but it didn't change the fact that COBOL, arguably the de facto programming standard of the time, was fundamentally changed long after it was a mature product.

    So, don't be so sure that IBM wouldn't have done the same thing... they've done it before :)

    Trek

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: The Deceiver
by sleepingsquirrel (Chaplain) on Aug 13, 2004 at 15:10 UTC
    I must be overlooking something, but why wouldn't this code work to remove all instances of "whatever"...
    open (FILE, "a.txt"); $/=undef; $txt = <FILE>; $txt =~ s/whatever//sig;


    -- All code is 100% tested and functional unless otherwise noted.
      Finding the things around 'whatever' is different than just removing 'whatever'.
        My code was just a literal translation of the OP's remark...
        "As you can see, this code slurps a file and removes all occurences of a certain word (`whatever')."


        -- All code is 100% tested and functional unless otherwise noted.
        Take a look at how that extract() routine is used a little more closely...
      I was trying to make a point about the fact that this code runs incredibly slower on the prepackaged RH9 Perl 5.8.0 compared to the prepackaged (Mandrake 8 I believe) Perl 5.6.1. The example above was whipped up especially for this experiment, after a period of tracking down the exact pieces of code which were slowing down my original Perl programs.

      Only after noticing that the =~ /(.*?) constructs were leading to neverending pauses in the Perl 5.8.0 code, did I realize that adding a ^ anchor would eliminate the inherent ambiguity (the /s switch was on). That's how I made this short example, in which I added the extract subroutine so I can get clear results in the DProf debugger and can make direct comparisons against the Perl versions. I was astounded to see that the slow ratio was not within 1.0 and 2.0 (meaning a tad slower), but somewhere between 500.0 and 1,000.0, explaining why buying new hardware was definitely more expensive than having somebody replace all /(.*?) regexps to /^(.*?) :).
Re: The Deceiver
by jryan (Vicar) on Aug 13, 2004 at 19:08 UTC
    This is an issue of good Perl and bad Perl

    You are correct. .*? is perhaps one of the least efficient singular regex constructs available. Why are you matching text you are not keeping, anyways? Are you unaware that there is an entirely separate construct (s/whatever//) made for removing text?

    Have you not read the extensive perlre documentation for the product that you are using? Just because something is free doesn't mean that you automatically know how to use it right-out-of-the-box.

    Also, if IBM had written Perl, it would probably take over a minute to start while it loaded its built in WSADIE plugins for J2EE development.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Why does a Perl 5.6 regex run a lot slower on Perl 5.8?
by kscaldef (Pilgrim) on Aug 15, 2004 at 04:11 UTC

    I am going to guess that you have a UTF-8 LANG set in your environment.

    A consequence of this in 5.8.0, but not in 5.6.1 or 5.8.5, is that your file is implicitly opened as UTF-8. This may seem minor, but because you included the /i modifier, it probably slowed it down a lot, since case insensitivity in Unicode is a lot more complicated. You could test this by modifying your environment and rerunning, or by explicitly opening the file as latin-1, or by removing the /i.

    You seem to discount the speed up you saw between 5.6.1 and 5.8.5 with your second regex version. I don't think this is really fair. I suspect that the regex engine really is faster in the later versions, when they are actually doing the same thing.

    The problem really seems to be that due to some subtleties in how certain things work in different versions of perl, the regex engine is not doing the same things in each of your cases. Since you are so willing to criticize the Perl community, I will gladly turn around and criticize you. This is not particularly obscure information. It's pretty well explained in perldelta, perlunicode, and other man pages. You apparently made the decision to upgrade perl versions without taking the time to research what changed. 5.6 to 5.8 is not a minor change: there are significant changes between the two which you would have been well advised to consider before making the switch.

    Furthermore, did you even stop to wonder why there were additional functions being called in one case and not the others? Don't you think this ought to have been a clue that things were not as simple as you would like to think?

      I am content about the one thing that I find relevant: my Perl codebase was successfully ported to Perl 5.8.5. My blaming the Perl community and other people's blaming me have both proved besides the point and their only merit -- artistic at most. As seems to always be the case, those people who had only techical points to make were the most useful.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://382646]
Approved by Limbic~Region
Front-paged by Plankton
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-28 18:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found