Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: regex gotcha moving from 5.8.8 to 5.30.0?

by SBECK (Chaplain)
on Feb 11, 2021 at 13:13 UTC ( [id://11128221]=note: print w/replies, xml ) Need Help??


in reply to Re: regex gotcha moving from 5.8.8 to 5.30.0?
in thread regex gotcha moving from 5.8.8 to 5.30.0?

So I did some more testing to tweak the regexps to see if I could narrow it down, and I found a new set of regexps that run the same speed from 5.8 all the way to 5.30 AND they are something like 10% faster than the original (on my host, your regexps take 10-12 secs, and mine take 9-11).

All of your regexps match up to (but not including) the newline. Based on the comment that '^' might be causing problems, I just added a '\s*' to the end of each regexp so basically each one of them grabs the newline and any additional whitespace. The following runs quickly for all versions:

sub parse_foo { my ($text) = @_; my $name; { last if $text =~ /\G \Z/gcmsx; if ($text =~ /\G begfoo \s+ (\S+?) \s* \( \s* (.*?) \s* \) + \s* ; \s*/gcmsx) { $name = $1 } elsif ($text =~ /\G endfoo \s* /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .*? \s* ; \s*/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST FOO: $name\n"; }

Replies are listed 'Best First'.
Re^3: regex gotcha moving from 5.8.8 to 5.30.0?
by mordibity (Acolyte) on Feb 11, 2021 at 21:09 UTC

    Very cool, thanks! It's a seductive solution, being even faster than the original regexes, but I'm having pangs about "correctness" of the format... By letting each sub-regex consume its trailing newline, I can no longer enforce that the main keywords are the first token on any given line, and input like this (all smushed together on one line) isn't flagged as illegal/unknown syntax:

    begfoo a ( a, b, c); endfoo begfoo b ( d, e, f ); input d; foo inst1 (a,b,c); endfoo

    In other words, way too liberal in what I accept! :-) The commercial tools would reject that instantly. But, for my reporting and analysis purposes, it's harmless, and it would let me move to 5.30 and pick up the other benefits of a more modern Perl... Hmm.

    I did spend some time experimenting/trying to write the sub-regexes to avoid the possibly-poisonous "\s* ^ \s*" to instead all begin with "\G ^" by either having each sub-regex consume their respective newline OR consuming them all in a separate sub-regex (like sw1 suggested in their "# march through any white space"), but I couldn't get it to work. I think it may be a catch-22 scenario: if the newline is present/next in the string, "\G ^" won't match it, since it matches after a newline. But if the newline has been consumed, "\G ^" also won't match it, since it's not there...)

      Tried a whole bunch of things, not all worked, but currently at about 20X faster on 231MB fake file (perl v5.32.0).

      #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11128141 use warnings; use Time::HiRes qw( time ); my $string = do { local(@ARGV, $/) = '50k.foo'; <> }; my $start = time; parse_v( $string ); printf "seconds %.3f for length %d file\n", time - $start, length $str +ing; sub parse_v { local $_ = shift; my $name; while( 1 ) { if(/\G (?: (?!endmodule\b|module\b) \S+ \s [^;]* ; | (?<!\N) endmodule \b) \s* /gcx) { } elsif(/\G (?<!\N) module \s+ (\S+?) \s* \( [^)]* \) \s* ; \s* /gcx +) { $name = $1 } else { /\G \z/gcx ? last : die "ERROR: unknown syntax at @{[pos($_ +)]}\n" } } print "LAST MODULE (Perl $]): $name\n"; }

      For double negative fans, (?<!\N) means "not preceded by not a newline".

      Hm. May some other variants help? Or are they way slower?....
      "\G \s*? ^ \s*" # non-greedy
      "\G (?= \s* ^ ) \s*" # look-ahead
      Upd. And do they reproduce regression?

      Upd. May that factoring out of "\s* ^ \s*" help?
      { last if $text =~ /\G \s* \Z/gcmsx; if ($text =~ /\G \s* ^ \s*/gcmsx) { if ($text =~ /\G module \s+ (\S+?) \s* \( \s* (.*?) \s +* \) \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G endmodule /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .*? \s* ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } } else { die "ERROR: unknown syntax\n" } redo; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11128221]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found