Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Regex and question of design

by Anonymous Monk
on Apr 14, 2005 at 11:28 UTC ( [id://447707]=note: print w/replies, xml ) Need Help??


in reply to Re: Regex and question of design
in thread Regex and question of design

which gives you the efficiency of performing a single match,
A single match is not always more efficient. A bunch of simple matches can be more efficient than a single, more complicated, match. And that's because of the Perl optimizer. Here's an example:
#!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; use Perl6::Slurp; use Benchmark qw /cmpthese/; use Test::More tests => 1; our @data = slurp '/usr/share/dict/words'; our(@a, @b); cmpthese(-10, { regex => '@a = grep {/qu/ || /x/} @data', ra => 'my $re = Regexp::Assemble->new->add("qu")->add("x")->re; @b = grep {/$re/} @data', }); is_deeply(\@a, \@b); __END__ 1..1 Rate ra regex ra 4.71/s -- -65% regex 13.5/s 186% -- ok 1

Replies are listed 'Best First'.
Re^3: Regex and question of design
by grinder (Bishop) on Apr 14, 2005 at 12:28 UTC
    A bunch of simple matches can be more efficient than a single, more complicated, match.

    True, but I must take you up on two issues.

    Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.

    Secondly, I wouldn't bother with such an approach for two patterns. It only starts to come into its own for a larger number. Where the sweet spot lies, I don't know... my educated guess is more than 10, less than 20.

    But even when you have as few as ten patterns, you have to start worrying about putting /foobar/ before /foo/. Failing to do so will result in 'foobar' never being matched ('foo' will succeed instead). If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'. That is what I meant when I talked of efficiency.

    - another intruder with the mooring in the heart of the Perl

      Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.
      Nope. Then the benchmark will be flawed. Constructing the pattern should be inside the benchmarked code. But note that constructing the assembled pattern is done outside of the grep - it's done once for each set of regexes to be matched. Just like it happens in practise.
      If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'.
      Indeed, if I use /bin/, /bat/, /bar/, /bong/, I get the following results from the benchmark:
      Rate regex ra regex 9.90/s -- -30% ra 14.2/s 43% --
      due to all 4 patterns starting with the same substring. But changing 'bong' to 'pong', it already flips the other way:
      Rate ra regex ra 5.25/s -- -49% regex 10.2/s 95% --
      So even with 3 out of 4 strings starting with the same character, the assembled pattern loses. If we go /bin/, /hat/, /car/, /pong/, we get:
      Rate ra regex ra 4.45/s -- -56% regex 10.1/s 126% --
      Increasing it to ten simple regexes, I get:
      Rate ra regex ra 2.53/s -- -48% regex 4.89/s 93% --
      And going to 20, I get:
      Rate ra regex ra 1.74/s -- -35% regex 2.69/s 54% --
      (words used: bin, hat, car, pong, zap, digit, foo, umbrella, apple, cherry, red, blue, white, green, yellow, orange, brown, purple, violet, black). This suggests that eventually, using a long bunch of simple regexes is slower than a long complicated one, but that it takes quite a lot for it to be faster.

      Now, I'm not claiming that a bunch of simpler regexes are always faster, not at all. All I'm saying is that the trade-off isn't as clear cut as you presented it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://447707]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-19 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found