Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: regex gotcha moving from 5.8.8 to 5.30.0?

by mordibity (Acolyte)
on Feb 10, 2021 at 18:20 UTC ( [id://11128192]=note: print w/replies, xml ) Need Help??


in reply to regex gotcha moving from 5.8.8 to 5.30.0?

Thank you all for your thoughts!

I guess being coy isn't really helpful, I was just trying to avoid distractions. This is (really) part of a parser I've written for structural Verilog netlists in an IC design environment. The whole parser does much more work and handles many more nuances of the input format, but I've been happy enough (for years) with the performance without trying to speed it up further (eg sw1's whitespace handling tips). For example, a typical report on a typical netlist input file (200+Mb) would run in ~10 seconds. But then CAD upgraded our central Perl (from 5.8.8 to 5.30) and the same code, on the same input took 1 hour 38 minutes! (With the same output results, so it's purely a performance issue, not a correctness issue.) So then I set about cutting it down to a very small testcase with the 5 sec to 105 sec delta.

AnomalousMonk: it's just plain ascii text; I tried throwing "aa" on it (for 5.30) and didn't change things.

kschwab: below I've put a better fake-data generator to get a consistent 2.5x slowdown (still not as bad as the 10x on the real data, but hopefully more representative of the problem. The majority of the names would be unique strings, etc, etc). And also here's the un-foo-ified cut-down parser, too.

SBECK: thank you! I guess I'll start looking at 5.20-related deltas. But I don't know that I'll be able to infer anything on my own; suggestions welcome from all on how to proceed (file a github issue, etc?)

faker & parser are used like this:

$ ./makev.pl 10000 > test.10k.v $ time perl_5.8.8 -w testv.pl test.10k.v LAST MODULE (Perl 5.008008): je0dhj perl_5.8.8 -w testv.pl test.10k.v 0.01s user 0.01s system 0% cpu 11.3 +52 total $ time perl_5.30 -w testv.pl test.10k.v LAST MODULE (Perl 5.030000): je0dhj perl_5.30 -w testv.pl test.10k.v 0.01s user 0.02s system 0% cpu 26.29 +1 total

makev.pl:

!/usr/bin/perl -w use Text::Wrap; use strict; my $num = shift or die "num?\n"; my @chars = ( "a" .. "z", 0 .. 9 ); for my $i (0 .. $num) { my $name = join("", @chars[ map { rand @chars } ( 1 .. 2+int(rand( +8)) )]); my @io = map { "p${name}${_}" } (0..int(rand(100))); my @hier = map { "m${name}${_}" } (0..int(rand(20))); my @leaf = map { "s${name}${_}" } (0..int(rand(200))); print "module ${name} ( ", wrap('', ' ', join(", ", @io)), "\n);\ +n"; print " inout $_;\n" foreach @io; print " wire $_;\n" foreach @io; for my $leaf (@leaf) { my @conn = map { ".P${name}${_} (n${name}${_})" } (0..int(rand +(5))); print "$leaf u_$leaf ( ", wrap('', ' ', join(", ", @conn)), " +\n);\n"; } for my $hier (@hier) { my @conn = map { ".p${name}${_} (n${name}${_})" } (0..int(rand +(100))); print "$hier u_$hier ( ", wrap('', ' ', join(", ", @conn)), " +\n);\n"; } print "endmodule\n\n"; }

testv.pl:

use strict; use File::Slurp; my $file = shift or die "file?\n"; my $text = read_file($file); parse_v($text); sub parse_v { my $text = shift; my $name; { last if $text =~ /\G \s* \Z/gcmsx; if ($text =~ /\G \s* ^ \s* module \s+ (\S+?) \s* \( \s* (. +*?) \s* \) \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G \s* ^ \s* endmodule /gcmsx) { } elsif ($text =~ /\G \s* ^ \s* \S+ \s+ .*? \s* ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST MODULE (Perl $]): $name\n"; }

Replies are listed 'Best First'.
Re^2: regex gotcha moving from 5.8.8 to 5.30.0?
by Tux (Canon) on Feb 11, 2021 at 14:37 UTC

    Thanks for the test case. Having every perl ever released available threaded and unthreaded, I get the folowing results when I only run on last minor of every major release (see below: left column is sorted on elapsed time, right column is sorted on perl version)

    It shows that performance for your test-case was really bad in 5.19, 5.20, 5.21, and 5.22, but it was much better in 5.23 again.

    It might be a worrying indicator that 5.31 and 5.32 are slower again then 5.30 and below.

    The last release that runs your test comparable to 5.8.8 is 5.18.4 (or 5.16.3)

    $ perl-all --time --last testv.pl test.2k.v Running perl-all testv.pl test.2k.v On 57 versions of perl ranging from perl5.6.2 to tperl5.33.6 : : rank elapsed cuser csys pass perl | perl + rank elapsed ==== ========== ======= ======= ==== ================= | ============= +==== ==== ========== 1 1.53014 1.440 0.090 PASS base/tperl5.6.2 | /usr/bin/perl + 4 1.821 2 1.54001 1.450 0.080 PASS base/perl5.6.2 | base/perl5.6. +2 2 1.540 3 1.77906 1.690 0.100 PASS base/perl5.8.9 | base/tperl5.6 +.2 1 1.530 4 1.82076 1.730 0.090 PASS /usr/bin/perl | base/perl5.7. +3 26 2.641 5 1.87299 1.790 0.080 PASS base/perl5.9.5 | base/tperl5.7 +.3 27 2.852 6 1.94761 1.850 0.090 PASS base/tperl5.8.9 | base/perl5.8. +9 3 1.779 7 1.96104 1.880 0.080 PASS base/tperl5.9.5 | base/tperl5.8 +.9 6 1.948 8 2.04763 1.960 0.090 PASS base/perl5.13.11 | base/perl5.9. +5 5 1.873 9 2.07248 1.990 0.090 PASS base/perl5.12.5 | base/tperl5.9 +.5 7 1.961 10 2.08193 1.990 0.100 PASS base/perl5.14.4 | base/perl5.10 +.1 15 2.171 11 2.08614 1.990 0.090 PASS base/perl5.15.9 | base/tperl5.1 +0.1 19 2.236 12 2.09114 2.010 0.090 PASS base/tperl5.15.9 | base/perl5.11 +.5 21 2.266 13 2.10513 2.020 0.080 PASS base/perl5.16.3 | base/tperl5.1 +1.5 16 2.176 14 2.11247 2.020 0.090 PASS base/tperl5.16.3 | base/perl5.12 +.5 9 2.072 15 2.17058 2.090 0.090 PASS base/perl5.10.1 | base/tperl5.1 +2.5 18 2.216 16 2.17647 2.080 0.100 PASS base/tperl5.11.5 | base/perl5.13 +.11 8 2.048 17 2.19196 2.100 0.090 PASS base/tperl5.14.4 | base/tperl5.1 +3.11 20 2.245 18 2.21637 2.130 0.090 PASS base/tperl5.12.5 | base/perl5.14 +.4 10 2.082 19 2.23559 2.120 0.110 PASS base/tperl5.10.1 | base/tperl5.1 +4.4 17 2.192 20 2.24530 2.150 0.090 PASS base/tperl5.13.11 | base/perl5.15 +.9 11 2.086 21 2.26649 2.180 0.080 PASS base/perl5.11.5 | base/tperl5.1 +5.9 12 2.091 22 2.44358 2.360 0.090 PASS base/tperl5.17.11 | base/perl5.16 +.3 13 2.105 23 2.45182 2.370 0.090 PASS base/perl5.17.11 | base/tperl5.1 +6.3 14 2.112 24 2.47311 2.380 0.100 PASS base/perl5.18.4 | base/perl5.17 +.11 23 2.452 25 2.56640 2.470 0.090 PASS base/tperl5.18.4 | base/tperl5.1 +7.11 22 2.444 26 2.64125 2.540 0.100 PASS base/perl5.7.3 | base/perl5.18 +.4 24 2.473 27 2.85200 2.750 0.100 PASS base/tperl5.7.3 | base/tperl5.1 +8.4 25 2.566 28 4.81823 4.780 0.030 PASS base/perl5.30.3 | base/perl5.19 +.11 54 12.516 29 4.84806 4.800 0.040 PASS base/perl5.28.3 | base/tperl5.1 +9.11 55 12.527 30 4.87266 4.830 0.040 PASS base/perl5.29.10 | base/perl5.20 +.3 57 12.621 31 4.87333 4.840 0.040 PASS base/tperl5.29.10 | base/tperl5.2 +0.3 50 12.173 32 4.90558 4.870 0.030 PASS base/perl5.26.3 | base/perl5.21 +.11 53 12.495 33 4.91229 4.880 0.040 PASS base/perl5.27.11 | base/tperl5.2 +1.11 56 12.529 34 4.93008 4.880 0.050 PASS base/tperl5.26.3 | base/perl5.22 +.4 52 12.443 35 4.94565 4.900 0.040 PASS base/tperl5.30.3 | base/tperl5.2 +2.4 51 12.416 36 4.96723 4.930 0.030 PASS base/perl5.23.9 | base/perl5.23 +.9 36 4.967 37 4.97168 4.940 0.040 PASS base/perl5.24.4 | base/tperl5.2 +3.9 41 5.148 38 5.02614 4.980 0.040 PASS base/tperl5.25.12 | base/perl5.24 +.4 37 4.972 39 5.05600 5.010 0.040 PASS base/tperl5.28.3 | base/tperl5.2 +4.4 43 5.201 40 5.12577 5.080 0.040 PASS base/tperl5.27.11 | base/perl5.25 +.12 42 5.160 41 5.14769 5.110 0.030 PASS base/tperl5.23.9 | base/tperl5.2 +5.12 38 5.026 42 5.16021 5.110 0.040 PASS base/perl5.25.12 | base/perl5.26 +.3 32 4.906 43 5.20146 5.150 0.040 PASS base/tperl5.24.4 | base/tperl5.2 +6.3 34 4.930 44 8.68506 8.640 0.040 PASS base/perl5.32.1 | base/perl5.27 +.11 33 4.912 45 8.84718 8.800 0.040 PASS base/perl5.31.11 | base/tperl5.2 +7.11 40 5.126 46 9.22812 9.200 0.030 PASS base/tperl5.31.11 | base/perl5.28 +.3 29 4.848 47 9.39026 9.360 0.030 PASS base/tperl5.32.1 | base/tperl5.2 +8.3 39 5.056 48 9.56254 9.530 0.040 PASS base/perl5.33.6 | base/perl5.29 +.10 30 4.873 49 10.05364 10.010 0.040 PASS base/tperl5.33.6 | base/tperl5.2 +9.10 31 4.873 50 12.17332 12.130 0.040 PASS base/tperl5.20.3 | base/perl5.30 +.3 28 4.818 51 12.41640 12.380 0.030 PASS base/tperl5.22.4 | base/tperl5.3 +0.3 35 4.946 52 12.44274 12.400 0.030 PASS base/perl5.22.4 | base/perl5.31 +.11 45 8.847 53 12.49524 12.450 0.040 PASS base/perl5.21.11 | base/tperl5.3 +1.11 46 9.228 54 12.51614 12.470 0.040 PASS base/perl5.19.11 | base/perl5.32 +.1 44 8.685 55 12.52714 12.480 0.040 PASS base/tperl5.19.11 | base/tperl5.3 +2.1 47 9.390 56 12.52865 12.480 0.050 PASS base/tperl5.21.11 | base/perl5.33 +.6 48 9.563 57 12.62089 12.570 0.040 PASS base/perl5.20.3 | base/tperl5.3 +3.6 49 10.054

    Enjoy, Have FUN! H.Merijn

      That's fascinating, thank you! It looks like through blind luck I've been frozen-in-time with a version whose performance happened to complement the way I was trying to tackle the parsing problem...

      For my side, I suppose I can stick with 5.8.8, or jump to 5.30+ by possibly changing my regexes as sw1 and SBECK have suggested to avoid "\s* ^ \s*" and compromising a little on the strictness of the parser.

      But for the sake of Perl, however, is there a recommended escalation path to alert the developers about this regression / add it to the test suite / etc? Or is it niche enough not to be interesting?

        Just a FYI: perl developers are already aware. (but I suppose you could still create an issue for it; less chance for it to be forgotten) It's also been discussed a bit on the #p5p IRC channel.
        You can try perlbug - 'how to submit bug reports on Perl'. Or ask someone else to create a ticket.
        Also you can add a link to this perlmonks forum discussion. I think it is a regression of at least of medium severity.
Re^2: regex gotcha moving from 5.8.8 to 5.30.0?
by SBECK (Chaplain) on Feb 11, 2021 at 13:13 UTC
    So I did some more testing to tweak the regexps to see if I could narrow it down, and I found a new set of regexps that run the same speed from 5.8 all the way to 5.30 AND they are something like 10% faster than the original (on my host, your regexps take 10-12 secs, and mine take 9-11).

    All of your regexps match up to (but not including) the newline. Based on the comment that '^' might be causing problems, I just added a '\s*' to the end of each regexp so basically each one of them grabs the newline and any additional whitespace. The following runs quickly for all versions:

    sub parse_foo { my ($text) = @_; my $name; { last if $text =~ /\G \Z/gcmsx; if ($text =~ /\G begfoo \s+ (\S+?) \s* \( \s* (.*?) \s* \) + \s* ; \s*/gcmsx) { $name = $1 } elsif ($text =~ /\G endfoo \s* /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .*? \s* ; \s*/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST FOO: $name\n"; }

      Very cool, thanks! It's a seductive solution, being even faster than the original regexes, but I'm having pangs about "correctness" of the format... By letting each sub-regex consume its trailing newline, I can no longer enforce that the main keywords are the first token on any given line, and input like this (all smushed together on one line) isn't flagged as illegal/unknown syntax:

      begfoo a ( a, b, c); endfoo begfoo b ( d, e, f ); input d; foo inst1 (a,b,c); endfoo

      In other words, way too liberal in what I accept! :-) The commercial tools would reject that instantly. But, for my reporting and analysis purposes, it's harmless, and it would let me move to 5.30 and pick up the other benefits of a more modern Perl... Hmm.

      I did spend some time experimenting/trying to write the sub-regexes to avoid the possibly-poisonous "\s* ^ \s*" to instead all begin with "\G ^" by either having each sub-regex consume their respective newline OR consuming them all in a separate sub-regex (like sw1 suggested in their "# march through any white space"), but I couldn't get it to work. I think it may be a catch-22 scenario: if the newline is present/next in the string, "\G ^" won't match it, since it matches after a newline. But if the newline has been consumed, "\G ^" also won't match it, since it's not there...)

        Tried a whole bunch of things, not all worked, but currently at about 20X faster on 231MB fake file (perl v5.32.0).

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11128141 use warnings; use Time::HiRes qw( time ); my $string = do { local(@ARGV, $/) = '50k.foo'; <> }; my $start = time; parse_v( $string ); printf "seconds %.3f for length %d file\n", time - $start, length $str +ing; sub parse_v { local $_ = shift; my $name; while( 1 ) { if(/\G (?: (?!endmodule\b|module\b) \S+ \s [^;]* ; | (?<!\N) endmodule \b) \s* /gcx) { } elsif(/\G (?<!\N) module \s+ (\S+?) \s* \( [^)]* \) \s* ; \s* /gcx +) { $name = $1 } else { /\G \z/gcx ? last : die "ERROR: unknown syntax at @{[pos($_ +)]}\n" } } print "LAST MODULE (Perl $]): $name\n"; }

        For double negative fans, (?<!\N) means "not preceded by not a newline".

        Hm. May some other variants help? Or are they way slower?....
        "\G \s*? ^ \s*" # non-greedy
        "\G (?= \s* ^ ) \s*" # look-ahead
        Upd. And do they reproduce regression?

        Upd. May that factoring out of "\s* ^ \s*" help?
        { last if $text =~ /\G \s* \Z/gcmsx; if ($text =~ /\G \s* ^ \s*/gcmsx) { if ($text =~ /\G module \s+ (\S+?) \s* \( \s* (.*?) \s +* \) \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G endmodule /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .*? \s* ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } } else { die "ERROR: unknown syntax\n" } redo; }
Re^2: regex gotcha moving from 5.8.8 to 5.30.0?
by choroba (Cardinal) on Feb 10, 2021 at 20:05 UTC
    What about reading the file in chunks? "endmodule" seems like a good candidate for the separator.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; local $/ = 'endmodule'; my $name; while (<>) { if (/^ \s* module \s+ (\S+?) \s* \( \s* .*? \s* \) \s* ;/xsm) { $name = $1; say $name; } } say "LAST MODULE (Perl $]): $name";
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0?
by haukex (Archbishop) on Feb 12, 2021 at 08:09 UTC
    below I've put a better fake-data generator to get a consistent 2.5x slowdown (still not as bad as the 10x on the real data, but hopefully more representative of the problem.

    IMHO, it would be best if you could figure out what's causing the 10x slowdown* and generate a fake file with that data. Perhaps anonymizing the original data would be an option?

    * Update: By which I mean: which part of the original data is causing the slowdown.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11128192]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-18 03:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found