Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Slow Regex - How to Optimize

by noslenj123 (Scribe)
on Aug 30, 2005 at 21:41 UTC ( [id://487957]=perlquestion: print w/replies, xml ) Need Help??

noslenj123 has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing some c++ files and at one point I need to extract any subroutine calls in a subroutine.

My current strategy is:

foreach my $line ( @sub_code ) { foreach my $sub ( keys %SUBS ) { if ( $line =~ /[^a-zA-Z]$sub[^a-zA-Z]*\(/ ) { push( @subs, $ke +y ) } } }
Which extracts the subroutine calls okay but it ends up being called 15,427 times and take 64+ seconds.

So is there any way to tune that regex or perhaps I should find a regex strategy to capture all subroutine calls at once for a subroutine block?

Can you nudge a disciple in the right direction?

Replies are listed 'Best First'.
Re: Slow Regex - How to Optimize
by dave_the_m (Monsignor) on Aug 30, 2005 at 21:50 UTC
    You code causes the regex to be recompiled each time round the inner loop. A slight rearrangement should improve things:
    foreach my $sub ( keys %SUBS ) { my $re = qr/[^a-zA-Z]$sub[^a-zA-Z]*\(/; foreach my $line ( @sub_code ) { if ( $line =~ $re ) { push( @subs, $key ) } } }

    Dave.

      That's not true. m/.../ only recompiles if the contents of any interpolated variable has changed. Demo:

      use re 'debug'; foreach (qw( abc abc def )) { print("==============================\n"); print("$_\n"); print("\n"); /$_/; }

      Update: Oops, I didn't notice you reversed the loops in addition to moving the regexp.

      great post.. thanks a lot :) it really solved an issue here. reduced the runtime of a usecase from 30 minutes to ~10 seconds. hats off! :)
Re: Slow Regex - How to Optimize
by GrandFather (Saint) on Aug 30, 2005 at 23:32 UTC

    Benchmark results for the curious

    Rate Original dave_the_m rnahi borisz ikegami + anon Original 17.7/s -- -97% -99% -99% -100% + -100% dave_the_m 603/s 3313% -- -76% -82% -87% + -87% rnahi 2502/s 14054% 315% -- -26% -46% + -47% borisz 3392/s 19088% 462% 36% -- -27% + -28% ikegami 4642/s 26158% 669% 86% 37% -- + -2% anon 4736/s 26693% 685% 89% 40% 2% + --

    Perl is Huffman encoded by design.

      Two notes:

      1) borisz's solution should be even faster on Perl 5.10 with due to a new regexp engine optimization.

      2) My solution and anonymonk's assume that %SUBS keys aren't regexps.

Re: Slow Regex - How to Optimize
by rnahi (Curate) on Aug 30, 2005 at 21:56 UTC

    I would use a different approach, i.e. make a single string from your @sub_code and apply the search for each key just once.

    #untested my $code = join("", @sub_code); foreach my $sub ( keys %SUBS ) { while ( $code =~ /\b$sub\b\(/g ) { push( @subs, $sub ) ; } }

    Notice that your code has a subtle bug. If the same routine is used twice in oone line, you'll get it only once. E.g.:  sqrt(x) + sqrt(y)

    BTW, what is that $key in your code?

      Notice that your code has a subtle bug.
      Even if he is running the code through CPP first (to strip out comments, expand macros, rejoin lines), he still has plenty of corner cases to worry about (strings, fucntion pointers, etc)...
      #define p printf int mai\ n/*this is a comment: main()*/(int argc, char **argv) { int (*f)(int, char **) = &main; p("hello world: main()\n"); if(argc>0) f((argc-1),argv); }
        Oh, and don't forget that [^a-zA-Z] matches "(" and that C identifiers can have digits and underscores...
        _foo2bar((4),2);
Re: Slow Regex - How to Optimize
by ikegami (Patriarch) on Aug 30, 2005 at 22:10 UTC

    Maybe using a generic regexp, then checking against %SUBS would be better? That eliminates a nested loop and multiple regexp compilations.

    foreach my $line ( @sub_code ) { if ( $line =~ /(\w+)\(/ ) { if ( exists $SUBS{$1} ) { push( @subs, $1 ); } } }

    Note: I assumed %SUBS keys are subroutine names, not regexps.

Re: Slow Regex - How to Optimize
by noslenj123 (Scribe) on Aug 30, 2005 at 22:50 UTC
    As usual, I didn't explain the concept of what I'm trying to accomplish along with the problem.

    I gather a list of created subroutines by parsing the .h files. That creates %SUBS.

    For each $sub in %SUBS I need to gather all the subroutines that the $sub calls, but only if is exists in %SUBS.

    From all your inputs I have learned how to work with the code as a string and apply the regex in a while loop which is so far much faster. I'm trying something like:

    while ( $data =~ /\b([a-zA-Z]+)\b\s*\(/g ) { print "$1\n"; }
    Thanks for the direction! I'll post results and look for more advice.
Re: Slow Regex - How to Optimize
by borisz (Canon) on Aug 30, 2005 at 22:09 UTC
    my $str = join '|', sort { length $b <=> length $a } keys %SUBS; my $re = qr/[^a-zA-Z]($str)[^a-zA-Z]*\(/; /$re/ and push @subs, $1 for ( @sub_code );
    Boris
Re: Slow Regex - How to Optimize
by InfiniteSilence (Curate) on Aug 30, 2005 at 22:22 UTC
    If I am not mistaken, this can be rewritten as:
    #!/usr/bin/perl -w use strict; my ($wholefile); my (@keys, %SUBS); $SUBS{'b'} = 1; $SUBS{'c'} = 1; local $/; open(H,qq|$ARGV[0]|) or die "USAGE: wierd2.pl <filename>"; #sloppy exa +mple $wholefile .= <H>; close(H); foreach (keys %SUBS) { while($wholefile=~m/\b$_\b/g){ push @keys, $_; } } 1;
    Of course, I have no idea how:
  • You are getting the %SUBS hash populated in the first place. I thought that was the purpose of the script?
  • Why you would want to do this

    Celebrate Intellectual Diversity

Re: Slow Regex - How to Optimize
by noslenj123 (Scribe) on Aug 30, 2005 at 23:05 UTC
    Okay guys! You rock!

    I recoded it and now instead of that process taking 64+ seconds, it now take .064 seconds. And I learned some more perl to boot! :-) Tx all!

Re: Slow Regex - How to Optimize
by Anonymous Monk on Aug 30, 2005 at 22:24 UTC
    I'd try a non-greedy quantifier...
    if ( $line =~ /[^a-zA-Z]$sub[^a-zA-Z]*?\(/ ) {
    ...and maybe a more generic subroutine finder...
    foreach my $line ( @sub_code ) { if ($line =~ /[^a-zA-Z]([a-zA-Z_]+[a-zA-Z_0-9]*)[^a-zA-Z]*\(/ and exists $SUBS{$1}) { push @subs, $key } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://487957]
Approved by kutsu
Front-paged by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-03-29 06:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found