Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Regex and question of design

by grinder (Bishop)
on Apr 14, 2005 at 10:43 UTC ( [id://447698]=note: print w/replies, xml ) Need Help??


in reply to Regex and question of design

I want to apply a set of regex on each line of this file

At the risk of tiring readers with yet another plug for my module, I'll point out that it offers a "tracked pattern" mode, whereby you can assemble your array of regexps into a single pattern, which gives you the efficiency of performing a single match, and the convenience of being able to determine which, of the original expressions, was the one that matched. The code would go something like:

use strict; use Regexp::Assemble; my $re = Regexp::Assemble->new( track => 1 ); open IN, shift || 'patterns' or die "open pattern file: $!\n"; while( <IN> ) { chomp; $re->add($_); } close IN; # read from, e.g., STDIN while( <> ) { chomp; if( defined( my $match = $re->match($_)) ) { print " $_ matched by $match\n"; } }

What is not obvious, is that behind the scenes, the $re->match() is performing a single match. It is not looping over the entire list of patterns. It has to be this way (rather than using the more intuitive if( /$re/ ) {...}) because of current broken behaviour in the regular expression engine (see bug #32840 for details).

Printing out the which particular pattern caused the match is not particularly helpful. What you really want to do is use the result as a hash key, either to look up a "human-readable" result string or a callback function, whatever suits your needs. If your patterns have captures (e.g. /x=(\d+)/), they are available for use.

- another intruder with the mooring in the heart of the Perl

Replies are listed 'Best First'.
Re^2: Regex and question of design
by Anonymous Monk on Apr 14, 2005 at 11:28 UTC
    which gives you the efficiency of performing a single match,
    A single match is not always more efficient. A bunch of simple matches can be more efficient than a single, more complicated, match. And that's because of the Perl optimizer. Here's an example:
    #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; use Perl6::Slurp; use Benchmark qw /cmpthese/; use Test::More tests => 1; our @data = slurp '/usr/share/dict/words'; our(@a, @b); cmpthese(-10, { regex => '@a = grep {/qu/ || /x/} @data', ra => 'my $re = Regexp::Assemble->new->add("qu")->add("x")->re; @b = grep {/$re/} @data', }); is_deeply(\@a, \@b); __END__ 1..1 Rate ra regex ra 4.71/s -- -65% regex 13.5/s 186% -- ok 1
      A bunch of simple matches can be more efficient than a single, more complicated, match.

      True, but I must take you up on two issues.

      Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.

      Secondly, I wouldn't bother with such an approach for two patterns. It only starts to come into its own for a larger number. Where the sweet spot lies, I don't know... my educated guess is more than 10, less than 20.

      But even when you have as few as ten patterns, you have to start worrying about putting /foobar/ before /foo/. Failing to do so will result in 'foobar' never being matched ('foo' will succeed instead). If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'. That is what I meant when I talked of efficiency.

      - another intruder with the mooring in the heart of the Perl

        Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.
        Nope. Then the benchmark will be flawed. Constructing the pattern should be inside the benchmarked code. But note that constructing the assembled pattern is done outside of the grep - it's done once for each set of regexes to be matched. Just like it happens in practise.
        If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'.
        Indeed, if I use /bin/, /bat/, /bar/, /bong/, I get the following results from the benchmark:
        Rate regex ra regex 9.90/s -- -30% ra 14.2/s 43% --
        due to all 4 patterns starting with the same substring. But changing 'bong' to 'pong', it already flips the other way:
        Rate ra regex ra 5.25/s -- -49% regex 10.2/s 95% --
        So even with 3 out of 4 strings starting with the same character, the assembled pattern loses. If we go /bin/, /hat/, /car/, /pong/, we get:
        Rate ra regex ra 4.45/s -- -56% regex 10.1/s 126% --
        Increasing it to ten simple regexes, I get:
        Rate ra regex ra 2.53/s -- -48% regex 4.89/s 93% --
        And going to 20, I get:
        Rate ra regex ra 1.74/s -- -35% regex 2.69/s 54% --
        (words used: bin, hat, car, pong, zap, digit, foo, umbrella, apple, cherry, red, blue, white, green, yellow, orange, brown, purple, violet, black). This suggests that eventually, using a long bunch of simple regexes is slower than a long complicated one, but that it takes quite a lot for it to be faster.

        Now, I'm not claiming that a bunch of simpler regexes are always faster, not at all. All I'm saying is that the trade-off isn't as clear cut as you presented it.

Re^2: Regex and question of design
by Random_Walk (Prior) on Apr 14, 2005 at 12:20 UTC

    grinder++, What a great module.

    When you use tracking to see which regex matched does it still compile to one regex internaly ? The docs say:

    track(0|1)
    Turns tracking on or off. When this attribute is enabled, additional housekeeping information is inserted into the assembled expression using ({...} embedded code constructs. This provides the necessary information to determine which, of the original patterns added, was the one that caused the match.
    $re->track( 1 ); if( $target =~ /$re/ ) { print "$target matched by ", $re->matched, "\n"; }
    Note that when this functionality is enabled, no reduction is performed and no character classes are generated. In other words, brag|tag is not reduced down to (?:br|t)ag<code> and dig|dim is not reduced to <code>di[gm].

    so I infer it it not as optimised as a non tracking version but still better than the looping solution ?

    If there are two regexen in the list that match does it return the first, last, all or an indeterminate selection of the above ?

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
      When you use tracking to see which regex matched does it still compile to one regex internaly

      Yes.

      not as optimised as a non tracking version but still better than the looping solution ?

      It has the same behaviour insofar as at any point during a tracked pattern match, the "degrees of freedom" the engine has available to try is the same as for an ordinary assembled pattern. It's just that the tracked pattern is stuffed full of (?{...}) zero-width eval assertions.

      If there are two regexen in the list that match does it return the first, last, all or an indeterminate selection of the above ?

      For a given target, the same path through the pattern will always be followed. In that regard it is perfectly deterministic, it is just that it is sometimes hard to determine in advance what that will be. It sort of makes sense if you squint hard enough. Consider:

      #! /usr/local/bin/perl -w use strict; use Regexp::Assemble; my $re = Regexp::Assemble->new(track => 1) # remember to double up your backslashes ->add( '^X\\d+' ) ->add( '^X\\d\\d*' ) ->add( '^\\s*X\\d\\d*' ) ->add( '^X\\d\\d' ) ->add( '^X\\d' ) ; while( <DATA> ) { chomp; print $re->matched, " matched <$_>\n" if $re->match($_); } __DATA__ XY1 X234 X56 X4 Z0 X77

      This produces:

      ^X\d\d* matched <X234> ^X\d\d* matched <X56> ^X\d\d* matched <X4> ^\s*X\d\d* matched < X77>

      But I would stress above all that the list of patterns in this case is in need of reformulation anyway. Hmm, in fact, this is very interesting. With a suitably exhaustive population of target test stings, you could use this approach to weed out "can't happen" patterns.

      Thank-you for asking this question! I might recycle some of the ideas in this thread into examples for the module distribution.

      - another intruder with the mooring in the heart of the Perl

Re^2: Regex and question of design
by Anonymous Monk on Apr 14, 2005 at 11:21 UTC
    Very interesting (I'm just finishing to read your README). I think that I'll program two scripts, one with my original idea and one with your module. So I can learn and understand more ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://447698]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-20 05:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found