Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: RegExps, Prematch and Postmatch without efficiency penalty

by gmax (Abbot)
on Sep 14, 2003 at 10:48 UTC ( #291375=note: print w/replies, xml ) Need Help??


in reply to RegExps, Prematch and Postmatch without efficiency penalty

Now I haven't benchmarked this assertion,

Why such a hurry?

but for now I'll risk injury and insult by taking the authors' word that unpack is faster.

Hold on. Here it comes. :)

It looks like substr is faster than unpack ...

#!/usr/bin/perl -w use strict; use Benchmark qw(timethese); my $repeat = 10; my $text = ('abc' x $repeat) . 'gotcha' . ('xyz' x $repeat); my ($pre,$match,$post); print "OS: $^O - Perl: $]\n"; timethese( 100000, { 'unpack' => sub { if ($text =~ /gotcha/) { $pre = prematch($text); $post = postmatch($text); $match = match($text); } }, 'substr' => sub { if ($text =~ /gotcha/) { $pre = substr_prematch($text); $post = substr_postmatch($text); $match = substr_match($text); } }, } ); if ($text =~ /gotcha/) { print "unpack\n"; print "prematch :", prematch($text), "\n"; print "match :", match($text), "\n"; print "postmatch :", postmatch($text), "\n"; print "substring\n"; print "prematch :", substr_prematch($text), "\n"; print "match :", substr_match($text), "\n"; print "postmatch :", substr_postmatch($text), "\n"; } sub prematch { return unpack "a$-[0]", $_[0]; } sub postmatch { return unpack "x$+[0] a*", $_[0]; } sub match { my $len = $+[0] - $-[0]; unpack "x$-[0] a$len", $_[0]; } sub substr_match { substr( $_[0], $-[0], $+[0] - $-[0] ) } sub substr_prematch { substr( $_[0], 0, $-[0] ) } sub substr_postmatch { substr( $_[0], $+[0] ) } __END__ (output edited to fit the page better) OS: cygwin - Perl: 5.008 Benchmark: timing 100000 iterations of substr, unpack... substr: 2 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) unpack: 11 wallclock secs (11.89 usr + 0.00 sys = 11.89 CPU) unpack prematch :abcabcabcabcabcabcabcabcabcabc match :gotcha postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz substring prematch :abcabcabcabcabcabcabcabcabcabc match :gotcha postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz OS: MSWin32 - Perl: 5.006001 Benchmark: timing 100000 iterations of substr, unpack... substr: 3 wallclock secs ( 1.93 usr + 0.00 sys = 1.93 CPU) unpack: 4 wallclock secs ( 3.87 usr + 0.00 sys = 3.87 CPU) unpack OS: linux - Perl: 5.006001 Benchmark: timing 100000 iterations of substr, unpack... substr: 2 wallclock secs ( 2.26 usr + 0.00 sys = 2.26 CPU) unpack: 4 wallclock secs ( 3.90 usr + 0.00 sys = 3.90 CPU) unpack

Update (1) And the reason is the overhead of interpolating $-[0] and $+[0] in the unpack parameter. If you run the benchmark with unpack "a30" and unpack "a36" it will be faster than substr but useless in general. So The Perl Cookbook was right after all. However, the devil is in the detail ... ;)

Update (2) The turning point is when the string is at least 5000 30,000 chars. With large strings, unpack becomes faster as advertised (Perl 5.6.1). Set $repeat to 5000, put an exit after timethese, and see the results. (Actually it is 5,000 groups of characters, thus creating a 30,000 chars string.)

Update (3) The original subs can be made slightly faster using sprintf instead of a simple interpolation. Using this version, the turning point, where the unpack version becomes faster than the substr implementation, is down to 3500 chars groups (= 21,000 chars).

sub prematch { unpack sprintf("a%d",$-[0]), $_[0]; } sub postmatch { unpack sprintf( "x%d a*", $+[0]) , $_[0]; } sub match { unpack sprintf ("x%d a%d", $-[0], $+[0] - $-[0] ), $_[0]; }
 _  _ _  _  
(_|| | |(_|><
 _|   

Replies are listed 'Best First'.
Re: Re: RegExps, Prematch and Postmatch without efficiency penalty
by Not_a_Number (Prior) on Sep 14, 2003 at 15:20 UTC

    Just out of interest, I thought I'd compare the gain from using constructs such as these compared with $`, $& and $'. I am no benchmark expert, but basically I added this, inside gmax's timethese() loop:

    'vanilla' => sub { if ($text =~ /gotcha/) { $pre = $`; $post = $'; $match = $&; } }, }

    and this inside the if ($text =~ /gotcha/) {} loop:

    print "vanilla\n"; print "prematch :", $`, "\n"; print "match :", $&, "\n"; print "postmatch :", $', "\n";

    The (slightly edited) output, for 500,000 iterations, is clear:

    Benchmark: timing 500000 iterations of substr, unpack, vanilla... substr: 5 wallclock secs <snip> @ 106678.05/s unpack: 8 wallclock secs <snip> @ 54241.70/s vanilla: 0 wallclock secs <snip> @ 492125.98/s

    So, on this basis alone, the much decried $`, $& and $' appear to be respectively almost five and nine times faster than the proposed 'substr' and 'unpack' subs.

    Of course, the real problem, in a programme of any length, does not lie here, but in the fact that "Any occurrence (of $`, $& and $') in your program causes all matches to save the searched string for possible future reference". So I tried modifying each timethese() loop by adding a fairly large number of subsequent match attempts:

    my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/;

    (having, of course, approproately defined strings $text2, $text3 and $text4) and I was surprised to see the following output:

    substr: 25 wallclock secs ... unpack: 26 wallclock secs ... vanilla: 26 wallclock secs ...

    (Full code here:)

    Does this mean:

    1 The much decried $`, $& and $' need to be rehabilitated?

    2 There's a benchmarking problem?

    3 I've missed something (Most likely reply :-)?

    thanks

    dave

      You have a benchmark problem.

      You see, the most vile aspect about $`, $& and $' is the fact that they affect every single regexp in your script or any module used. So if you benchmark it against other approaches that also use regular expressions, you'll slow them down, too. Conclusion: if you want to compare approaches, you should do benchmark them in separate scripts.

      You should also check out how speed compares between matches, especially on extremely long strings, with and without one of ($`, $&, $') being mentioned anywhere in your script. Because it's there that it matters, not where you actually make use of them, but everywhere else.

        ...you should do benchmark them in separate scripts.

        This is similar to the problem I had trying to benchmark the memory footprint of threaded applications. See Benchmark::Thread::Size for an example of how I embedded scripts within scripts to get a clean memory benchmark. I think a similar (probably simpler) approach could be taken to benchmark alternate approaches to $`, $& and $'.

        Liz

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://291375]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2022-12-07 17:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?