Re^2: regex gotcha moving from 5.8.8 to 5.30.0?

So I did some more testing to tweak the regexps to see if I could narrow it down, and I found a new set of regexps that run the same speed from 5.8 all the way to 5.30 AND they are something like 10% faster than the original (on my host, your regexps take 10-12 secs, and mine take 9-11).

All of your regexps match up to (but not including) the newline. Based on the comment that '^' might be causing problems, I just added a '\s*' to the end of each regexp so basically each one of them grabs the newline and any additional whitespace. The following runs quickly for all versions:

sub parse_foo {
    my ($text) = @_;
    my $name;
    {
        last if $text =~ /\G \Z/gcmsx;

        if     ($text =~ /\G begfoo \s+ (\S+?) \s* \( \s* (.*?) \s* \)
+ \s* ; \s*/gcmsx) { $name = $1 }
        elsif  ($text =~ /\G endfoo \s*        /gcmsx) { }
        elsif  ($text =~ /\G \S+ \s+  .*? \s* ; \s*/gcmsx) { }
        else { die "ERROR: unknown syntax\n" }

        redo;
    }
    print "LAST FOO: $name\n";
}
[download]

Comment on Re^2: regex gotcha moving from 5.8.8 to 5.30.0? Download Code

Replies are listed 'Best First'.
Re^3: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 11, 2021 at 21:09 UTC
Very cool, thanks! It's a seductive solution, being even faster than the original regexes, but I'm having pangs about "correctness" of the format... By letting each sub-regex consume its trailing newline, I can no longer enforce that the main keywords are the first token on any given line, and input like this (all smushed together on one line) isn't flagged as illegal/unknown syntax: `begfoo a ( a, b, c); endfoo begfoo b ( d, e, f ); input d; foo inst1 (a,b,c); endfoo` In other words, way too liberal in what I accept! :-) The commercial tools would reject that instantly. But, for my reporting and analysis purposes, it's harmless, and it would let me move to 5.30 and pick up the other benefits of a more modern Perl... Hmm. I did spend some time experimenting/trying to write the sub-regexes to avoid the possibly-poisonous "\s* ^ \s" to instead all begin with "\G ^" by either having each sub-regex consume their respective newline OR consuming them all in a separate sub-regex (like sw1 suggested in their "# march through any white space"), but I couldn't get it to work. I think it may be a catch-22 scenario: if the newline is present/next in the string, "\G ^" won't match it, since it matches after* a newline. But if the newline has been consumed, "\G ^" also won't match it, since it's not there...)	[reply] [d/l]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by tybalt89 (Monsignor) on Feb 12, 2021 at 02:56 UTC
Tried a whole bunch of things, not all worked, but currently at about 20X faster on 231MB fake file (perl v5.32.0). #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11128141 use warnings; use Time::HiRes qw( time ); my $string = do { local(@ARGV, $/) = '50k.foo'; <> }; my $start = time; parse_v( $string ); printf "seconds %.3f for length %d file\n", time - $start, length $str +ing; sub parse_v { local $_ = shift; my $name; while( 1 ) { if(/\G (?: (?!endmodule\b\|module\b) \S+ \s [^;]* ; \| (?<!\N) endmodule \b) \s* /gcx) { } elsif(/\G (?<!\N) module \s+ (\S+?) \s* $ [^)]* $ \s* ; \s* /gcx +) { $name = $1 } else { /\G \z/gcx ? last : die "ERROR: unknown syntax at @{[pos($_ +)]}\n" } } print "LAST MODULE (Perl $]): $name\n"; } [download] For double negative fans, `(?<!\N)` means "not preceded by not a newline".	[reply] [d/l] [select]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by rsFalse (Chaplain) on Feb 11, 2021 at 22:40 UTC
Hm. May some other variants help? Or are they way slower?.... `"\G \s? ^ \s"` # non-greedy `"\G (?= \s* ^ ) \s"` # look-ahead Upd. And do they reproduce regression? Upd. May that factoring out of `"\s ^ \s"` help? `{ last if $text =~ /\G \s \Z/gcmsx; if ($text =~ /\G \s* ^ \s/gcmsx) { if ($text =~ /\G module \s+ (\S+?) \s $ \s* (.?) \s + $ \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G endmodule /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .? \s ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } } else { die "ERROR: unknown syntax\n" } redo; }` [download]	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks