Re: Parser Performance Question
by kcott (Archbishop) on Oct 05, 2017 at 00:48 UTC
|
G'day songmaster,
As LanX alluded to in his update,
I suspect the /o modifier may be the issue:
it certainly stood out as I read through the code fragments you provided.
In "perlre: Modifiers" you'll see:
"o - pretend to optimize your code, but actually introduce bugs"
That provides a link to further information in
"perlop: Regexp Quote-Like Operators"
but the fragment identifier
(#s%2fPATTERN%2fREPLACEMENT%2fmsixpodualngcer) is wrong.
The closest to that is probably
#s/_PATTERN_/_REPLACEMENT_/msixpodualngcer; however, the one with the most information about /o,
and probably more appropriate given the code you've shown,
is #m/_PATTERN_/msixpodualngc,
which culminates in:
"The bottom line is that using /o is almost never a good idea."
I probably would have created all of those regexes at compile time,
and I would have used my instead of our variables.
A dispatch table with actions based on matches may also be appropriate.
You don't show sufficient code to make any direct modification recommendations.
The following script simply suggests a technique you could adapt to your needs.
#!/usr/bin/env perl
use strict;
use warnings;
my %capture;
BEGIN {
my $RXone = qr{(?x: 1 )};
my $RXthree = qr{(?x: 3 )};
my $RXnum = qr{(?x: $RXone | $RXthree )};
my $RXstr = qr{(?x: ( [a-z]+ $RXnum ) )};
%capture = (
menu => {
regexp => qr{(?x: ^ menu \s+ $RXstr $ )},
action => sub { parse_menu(@_) },
},
driver => {
regexp => qr{(?x: ^ driver \s+ $RXstr $ )},
action => sub { parse_driver(@_) },
},
);
}
my @capture_keys = keys %capture;
while (<DATA>) {
for my $capture_key (@capture_keys) {
if (/$capture{$capture_key}{regexp}/) {
$capture{$capture_key}{action}->($1);
last;
}
}
}
sub parse_menu { print "MENU: @_\n" }
sub parse_driver { print "DRIVER: @_\n" }
__DATA__
menu menu1
driver driver1
other other1
menu menu2
driver driver2
other other2
menu menu3
driver driver3
other other3
You may have sufficient, up-front knowledge about those "capture keys"
to predefine an ordered @capture_keys rather than relying on the random list returned by keys.
Output from a sample run of that script:
MENU: menu1
DRIVER: driver1
MENU: menu3
DRIVER: driver3
Update (minor code alteration):
My original code had $capure_key (missing "t") throughout.
I've changed that to $capture_key globally; retested; output unchanged.
| [reply] [d/l] [select] |
|
It turns out that qr// is actually the worst way to build regexes, but adding /o mostly fixes the problem.
use Benchmark qw( cmpthese );
print $], "\n";
open my $W, '<', '/usr/share/dict/words' or die;
my @words = <$W>;
close $W;
my $s = '(?<![cC])[eE][iI]';
my $re = qr/(?<![cC])[eE][iI]/;
cmpthese(-5, {
re => sub { grep /(?<![cC])[eE][iI]/, @words },
qr => sub { grep /$re/, @words },
qro => sub { grep /$re/o, @words },
s => sub { grep /$s/, @words },
so => sub { grep /$s/o, @words },
});
__END__
5.026001
Rate qr s qro so re
qr 7.14/s -- -57% -67% -68% -69%
s 16.8/s 135% -- -22% -26% -26%
qro 21.4/s 200% 28% -- -5% -6%
so 22.5/s 215% 34% 5% -- -1%
re 22.7/s 218% 36% 6% 1% --
| [reply] [d/l] |
|
Update: sorry please ignore, misread benchmark
Well maybe you used the worst way to ask the question.
qr is meant to precompile, so why should it be used in a loop?
The OP is building a parser, his grammar doesn't change in the fly.
| [reply] |
Re: Parser Performance Question (updated /o)
by LanX (Saint) on Oct 04, 2017 at 21:59 UTC
|
Just some general advice:
It's probably an optimization or bug fix having bad side-effects.
If I were you I'd try to see if the regex-op-trees stay the same with use re 'debug';
see re for details.
You could also have a look into perldelta 5.20, to see what changed with 5.20.
edit
And you should provide a minimal example reproducing the problem, I suppose your benchmark is only reflecting the "real" program we can't probably know.
update
A guess: The /o modifier is nowadays mostly useless (IIRC)!
Try to measure if it's probably causing the problem. | [reply] [d/l] [select] |
Re: Parser Performance Question
by songmaster (Beadle) on Oct 05, 2017 at 09:02 UTC
|
Thanks for the replies so far. Taking out all the /o flags (which were supposed to speed up regexes back when we actually used Perl 5.6, yes this project is that old although not this code) helps a bit, now under Perl 5.24.1 the timing is:
woz$ perlbrew use 5.18.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd
real 0m0.417s
user 0m0.377s
sys 0m0.020s
woz$ perlbrew use 5.24.1
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd
real 0m7.549s
user 0m7.215s
sys 0m0.077s
So that's another 2 seconds saved, but it still takes 7 seconds longer than it does under Perl 5.18.0.
@Ken: I'm using our variables because they are actually set in another module. Given that the profiler doesn't show any significant amount of time spent in the (presumably related, but I don't know the internals) Parser::CORE:regcomp opcode I don't think pre-compiling the regexp's will make any difference.
Looking through the individual regexp profiles again, I now see that there is one for detecting Perl POD which is taking up almost all of that 7 seconds:
if (m/\G ( = [a-zA-Z] .* ) \n/xgc) {
$obj->add_pod($1, parsePod());
}
Any ideas why this specific regexp is so slow in Perl >= 5.20? It's probably the only one that uses .* to match to the end of a line.
I tried adding use re "debug"; and it's outputting lots of lines like this, which given the reference to an anchored substr "=" is probably the above match:
doing 'check' fbm scan, [345261..414818] gave 345300
Found floating substr "%n" at offset 345300 (rx_origin now 345259)..
+.
doing 'other' fbm scan, [345259..345299] gave -1
Contradicts anchored substr "="; about to retry anchored at offset 3
+45301 (rx_origin now 345299)...
doing 'check' fbm scan, [345301..414818] gave 345317
Found floating substr "%n" at offset 345317 (rx_origin now 345299)..
+.
doing 'other' fbm scan, [345299..345316] gave -1
Contradicts anchored substr "="; about to retry anchored at offset 3
+45318 (rx_origin now 345316)...
Further hints on understanding what those messages mean and how to clean this up would be most appreciated.
- Andrew
| [reply] [d/l] [select] |
|
> there is one for detecting Perl POD which is taking up almost all of that 7 seconds:
POD has to start at the beginning of a line, try to anchor your = there.
Seems like your regex is backtracking from the end after anchoring at \n, Perl has heuristics to decide if it starts searching from the back or from the start.
And if it's not parsing line by line but the whole file, you might end up with an exponential growth by file size (try benchmarking other file sizes)
I have to say some of your regexes so far look broken and badly tested, it's not Perl's fault if it's not performing optimal.
And PLEASE try to provide a SSCCE including input data to facilitate us helping instead of speculating.
| [reply] |
|
I agree with LanX: we need a short, self-contained example. For example I tried with this, but it doesn't reproduce the problem:
$n = 'x' x 50 . "\n";
$p = "=foo $n";
$np = ($n x 50) . $p;
$_= $np x 100_000;
1 while m/\G ( = [a-zA-Z] .* ) \n/xgc;
In fact for me, 5.20 is 3 times faster than 5.18 with that example. Since for 5.20.0 I heavily reworked the part of the regex engine which is giving those debugging messages you show, I'd be very interested to have access to real working examples of where my changes made things go slower rather than faster.
Dave. | [reply] [d/l] |
|
Sorry for the delay in responding further to this, and thanks to everyone for their input. The fix I have committed for now was to move the .* match out into a separate regex from the = [a-zA-Z] part and this works okay, but I would prefer something slightly less ugly.
Here is some stand-alone code that demonstrates the regression, although it doesn't show quite as dramatic a slow-up as my original:
#!env perl
$l = 'x' x 50 . "\n";
$x = $l x 50;
$p = "=foo bar\n";
$_= ($x . $p) x 500 . $x;
$nx = 0;
while (1) {
if (m/\G ( = [a-zA-Z] .* ) \n/xgc) {
$pod .= $1;
}
elsif (m/\G x+ \n/xgc) {
# match xxx lines
$nx++;
}
else {
last;
}
}
My results show this taking 3-4 times as long under 5.20.0 as under 5.18.0:
woz$ perlbrew use 5.18.0
woz$ time perl re.pl
real 0m0.035s
user 0m0.026s
sys 0m0.004s
woz$ perlbrew use 5.20.0
woz$ time perl re.pl
real 0m0.128s
user 0m0.120s
sys 0m0.005s
- Andrew | [reply] [d/l] [select] |
Re: Parser Performance Question
by Anonymous Monk on Oct 04, 2017 at 22:46 UTC
|
Hi, don't use the /o flag , /o has been useless since about 5.6 when qr replaced /o, the docs really ought to read o - pretend to optimize your code, but actually introduce bugs and perl ought to reject /o outright :)
| [reply] |
Re: Parser Performance Question
by KurtZ (Friar) on Oct 05, 2017 at 12:39 UTC
|
Would love to replace the Perl code with Python,
Good luck, python has a compatible regexp engine.
I don't think it can solve conceptional problems in your code. :)
| [reply] |
|
| [reply] [d/l] |
Re: Parser Performance Question
by Anonymous Monk on Oct 05, 2017 at 13:26 UTC
|
Almost everyone seems to think that it is best to "start over" in the language that they prefer . . . until they actually start trying to do it! | [reply] |
Re: Parser Performance Question
by LanX (Saint) on Oct 05, 2017 at 11:44 UTC
|
Unrelated to your performance question, but this
>
qr/ " (?: [^"] | \\" )* " /x;
is broken code.
Try to guess what it's supposed to do, and then try to test what it really does, and you'll see why testing is important.
| [reply] [d/l] |
|
Please, stop talking in riddles. Do you mean "a\\"b"?
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
|
|
Re: Parser Performance Question
by ikegami (Patriarch) on Oct 07, 2017 at 01:32 UTC
|
Nearly every if not every "*", "+" and "?" in your code should be followed by a "+" to prevent unnecessary backtracking. This might side-step the issue?
| [reply] [d/l] [select] |