Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

RE (tilly) 4: SAS log scanner

by tilly (Archbishop)
on Sep 03, 2000 at 00:41 UTC ( [id://30896]=note: print w/replies, xml ) Need Help??


in reply to RE: RE: RE (tilly) 1: SAS log scanner
in thread SAS log scanner

For an input problem, you could just use the input script again. That is what I did. Of course then you are really testing the time to compile code...

In any case there is overhead to entering a block, and to a hash lookup. Therefore I believe that tye's approach really is faster than your loop. But I went and tested it and found that between the three approaches the difference was all in the compilation. So I went for a longer file (a bit over 4 MB) and found that I was a bit under 27 seconds, tye just over 35, and yours 2 min, 22 seconds. I think tye's is faster than yours, and mine was not so bad after all. :-)

That said, perhaps I am not so ashamed to show the sophisticated method of building the RE, that coerces the RE engine into an optimization it should know about but does not (yet):

use strict; my $match = &ret_match_any( "0 OBS", "AT LEAST", "EXTRANEOUS", "CARTESIAN", "CLOSING", "CONVERT", "DIVISION BY ZERO", "DOES NOT EXIST", "DUE TO LOOPING", "END OF MACRO", "ENDING EXECUTION", "ERROR", "ERRORABEND", "ERRORCHECK=STRICT", "EXCEED", "HANGING", "HAS 0 OBSERVATIONS", "ILLEGAL", "INCOMPLETE", "INVALID", "LOST CARD", "MATHEMAT", "MERGE STATEMENT", "MISSING", "MULTIPLE", "NOT FOUND", "NOT RESOLVED", "OBS=0", "REFERENCE", "REPEAT", "SAS CAMPUS DRIVE", "SAS SET OPTION OBS=0", "SAS WENT", "SHIFTED", "STOP", "TOO SMALL", "UNBALANCED", "UNCLOSED", "UNREF", "UNRESOLVED", "WARNING" ); while(<>) { if ($_ =~ $match) { print "line $., file $ARGV, problem $1\n$_\n"; } } # Takes a list of strings and returns an RE that matches any. sub ret_match_any { my $match_str = &trie_strs(map quotemeta, @_); return qr /($match_str)/; } # Takes a list of escaped strings and returns a single string # suitable for building a match in an efficient way. # Works recursively by grouping strings that share one character sub trie_strs { unless (@_) { return (); } my %rest; foreach my $str (@_) { if (length($str)) { my $chr = substr($str, 0, 1); if ("\\" eq $chr) { $chr = substr($str, 0, 2); push @{$rest{$chr}}, substr($str, 2); } else { push @{$rest{$chr}}, substr($str, 1); } } else { $rest{''} = ['']; } } my @to_join; foreach my $chr (keys %rest) { my $list_ref = $rest{$chr}; if (1 < @$list_ref) { push @to_join, $chr . &trie_strs(@$list_ref); } else { push @to_join, $chr . $list_ref->[0]; } } if (1 < @to_join) { return '(?:' . (join '|', @to_join) . ')'; } else { return $to_join[0]; } }

Replies are listed 'Best First'.
RE: RE (tilly) 4: SAS log scanner
by takshaka (Friar) on Sep 03, 2000 at 10:45 UTC
    Let me guess...you tested this on 5.005, not 5.6?

    I was so flabbergasted by those results that I just had to test myself, and it turns out that there is a huge discrepency between the same code running on 5.005_3 vs 5.6

    Here are my results. tye's neat coderef thing wasn't returning valid output ($1 was always undef since the one we wanted was localized to a different block), so I modified that slightly to get correct results. Also, all of the techniques tested assume only one match per line.

    #!/usr/bin/perl -w use strict; use Benchmark; use vars qw(@problem); @problem = ( "0 OBS", "AT LEAST", "EXTRANEOUS", "CARTESIAN", "CLOSING", "CONVERT", "DIVISION BY ZERO", "DOES NOT EXIST", "DUE TO LOOPING", "END OF MACRO", "ENDING EXECUTION", "ERROR", "ERRORABEND", "ERRORCHECK=STRICT", "EXCEED", "HANGING", "HAS 0 OBSERVATIONS", "ILLEGAL", "INCOMPLETE", "INVALID", "LOST CARD", "MATHEMAT", "MERGE STATEMENT", "MISSING", "MULTIPLE", "NOT FOUND", "NOT RESOLVED", "OBS=0", "REFERENCE", "REPEAT", "SAS CAMPUS DRIVE", "SAS SET OPTION OBS=0", "SAS WENT", "SHIFTED", "STOP", "TOO SMALL", "UNBALANCED", "UNCLOSED", "UNREF", "UNRESOLVED", "WARNING" ); open FOO, ">/dev/null" or die $!; timethese (5, {NO_REGEX => \&no_regex, CODE_REGEX => \&code_regex, BIG_REGEX => \&big_regex, MANY_REGEXES => \&many_regexes }); sub no_regex { local @ARGV = @ARGV; while(<>) { my $up= uc $_; foreach my $p ( @problem ) { if( 0 <= index($up,$p) ) { print FOO "line $.: problem: $p\n$_\n"; last; } } } } sub big_regex { local @ARGV = @ARGV; my $match = ret_match_any(@problem); while(<>) { if ($_ =~ $match) { print FOO "line $.: problem: $1\n$_\n"; } } } sub ret_match_any { # same as tilly's original } sub trie_strs { # same as tilly's original } sub many_regexes { local @ARGV = @ARGV; local @problem = map {qr/(\Q$_\E)/i} @problem; while (<>) { for my $p (@problem) { print FOO "line $.: problem: $1\n$_\n" and last if /$p/; } } } sub code_regex { local @ARGV = @ARGV; my $code= "sub { /(" . join ")/i || /(", map {"\Q$_\E"} @problem; $code .= ')/i and return $1}'; my $match= eval $code; die "$@" unless ref($match) && UNIVERSAL::isa($match,"CODE"); while(<>) { if( my $p = &$match() ) { print FOO "line $.: problem: $p\n$_\n"; } } } __END__ 5.6: chh@scallop test> perl matchtest sample.txt Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 90 wallclock secs (89.58 usr + 0.28 sys = 89.86 CPU) @ 0 +.06/s (n=5) CODE_REGEX: 53 wallclock secs (53.13 usr + 0.36 sys = 53.49 CPU) @ 0 +.09/s (n=5) MANY_REGEXES: 60 wallclock secs (59.27 usr + 0.28 sys = 59.55 CPU) @ + 0.08/s (n=5) NO_REGEX: 44 wallclock secs (43.33 usr + 0.30 sys = 43.63 CPU) @ 0 +.11/s (n=5) 5.005_3: Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 79 wallclock secs (77.08 usr + 0.61 sys = 77.69 CPU) CODE_REGEX: 357 wallclock secs (354.97 usr + 0.64 sys = 355.61 CPU) MANY_REGEXES: 363 wallclock secs (361.99 usr + 0.73 sys = 362.72 CPU) NO_REGEX: 43 wallclock secs (42.84 usr + 0.19 sys = 43.03 CPU) The 10MB test file was generated thusly: my @chars = map chr($_), 32..127; open SAMPLE, ">sample.txt" or die $!; while (-s SAMPLE < 1024*1024*10) { my $line = join '', map { $chars[rand @chars] } 1..100; substr $line, rand(100), 0, $problem[rand @problem]; print SAMPLE "$line\n"; } close SAMPLE;
    Update: I just noticed that the stub above was using a different name for the @problem array, which makes it look as if I was generating all non-matching lines when I was really generating exactly one match per line.

      With all of the benchmarks flying around, I don't see anyone testing the original suggestion of /this|that|etc/ without the trie fanciness.

      nada: 0.39 CPU @ 2.56/s (n=1) trie: 3.73 CPU @ 0.27/s (n=1) or: 6.04 CPU @ 0.17/s (n=1) alt_re: 8.52 CPU @ 0.12/s (n=1) alt_sub: 8.62 CPU @ 0.12/s (n=1) index: 9.17 CPU @ 0.11/s (n=1) list: 15.38 CPU @ 0.07/s (n=1)

      Note that "or" is faster than either "alt" version, which was my original point. I was just curious is that was still true since I couldn't find the documentation that stated that any more.

      "nada" just reads the data. "trie" is tilly's stuff. "or" is my sub with lots of little regexen. "alt_re" is one big regex with lots of alternates compiled with the qr// operator. "alt_sub" is the same but wrapped in a subroutine and compiled with eval. "index" is my looping and using index(). "list" is looping using lots of little regexec compiled with qr//.

      I think "index" and "list" lose here because of all of the looping for non-matching lines.

              - tye (but my friends call me "Tye")
      I noticed the same problem in tye's first approach but just ignored it rather than fixing. And yes, I was on 5.005_03. So I re-ran your benchmark with my data on my laptop and got:
      bash-2.01$ perl scanall.pl scantst Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 138 wallclock secs (133.78 usr + 0.96 sys = 134.74 CPU) CODE_REGEX: 532 wallclock secs (517.61 usr + 2.68 sys = 520.29 CPU) MANY_REGEXES: 644 wallclock secs (626.63 usr + 2.96 sys = 629.59 CPU) NO_REGEX: 176 wallclock secs (170.91 usr + 1.41 sys = 172.32 CPU) bash-2.01$ p56 scanall.pl scantst Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 167 wallclock secs (160.56 usr + 2.00 sys = 162.56 CPU) @ + 0.03/s (n=5) CODE_REGEX: 171 wallclock secs (166.17 usr + 1.49 sys = 167.66 CPU) @ + 0.03/s (n=5) MANY_REGEXES: 241 wallclock secs (232.62 usr + 2.10 sys = 234.72 CPU) + @ 0.02/s (n=5) NO_REGEX: 175 wallclock secs (169.77 usr + 1.34 sys = 171.11 CPU) @ + 0.03/s (n=5) bash-2.01$
      My dataset is around 4MB. My laptop is an old laptop, 64 MB RAM, 200 MHz CPU, running Linux. The biggest single difference though is that my data set is a version of the script catted against itself until I got up to around 4 MB. So most lines match some error.

      Therefore your numbers mainly test how quickly it can discard a line, mine how quickly it can recognize that there is one and locate it. My regex, being so complex, gives virtually no chance for optimizations to throwing away lines. It also hits Ilya's new tests for some very slow REs and so slowed down moving forward, while the other tests sped up quite considerably. (Basically the big RE is testing for signs of excessive backtracking. Well the RE is designed specifically with avoiding backtracking in mind, but loses time on the testing.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://30896]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-04-25 08:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found