Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Count multiple pattern matches

by johnnywang (Priest)
on Dec 07, 2004 at 01:19 UTC ( [id://412798]=perlquestion: print w/replies, xml ) Need Help??

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a collection of patterns (actually just keywords). For a given input string, I'd like to find out the number of matches for each of the keywords. A simple solution is as follows:
use strict; my @keywords=qw/foo bar 12345 abcd/; my $string = "foobarfoo1234523423412345abcdefsadfabc"; my %result; foreach my $k (@keywords){ my @matches = ($string =~ /$k/ig ); $result{$k} = scalar(@matches); } foreach my $s (keys %result){ print "$s=>$result{$s}\n"; } __END__ bar=>1 12345=>2 foo=>2 abcd=>1
Now my problem is that the array @keywords can be big (~ 10000), and the string can be pretty long. The above method goes through the string many times (same as @keywords), I'd like to make it more efficient by going through it, say once. One step in that direction is probably concat a long expression:
$pattern = join("|", @keywords);
but how do I get the count? Are there other ways (e.g., construct explicitly some kind of DFA) to do this? Thanks.

20041208 Edit by castaway: Changed title from 'Count mutliple pattern matches'

Replies are listed 'Best First'.
Re: Count multiple pattern matches
by Zaxo (Archbishop) on Dec 07, 2004 at 02:04 UTC

    One way is almost what you have, but with some changes,

    use strict; my (@keywords, %keyword)=qw/foo bar 12345 abcd/; my ($string, %result) = "foobarfoo1234523423412345abcdefsadfabc";
    Precompile the regexen, @keyword{@keywords} =  map {qr/\Q$_\E/} @keywords; Now get the count directly without any named temporary,
    $result{$_} = () = $string =~ $keyword{$_} for (keys %keyword);
    That's no big change over what you have, but uses some idiomatic optimizations.

    Another way is to count within the big regex you mention. You can do that with a code construction in the re,

    my @regexen = map { qr/(?:\Q$_\E(?{$result{$_}++}))/ } @keywords; my $re = do { local $" = '|'; qr/@regexen/; };
    I've used my favorite tricky way of getting alternation into an array there (qr// is an interpolating quote operator).
    $string =~ /$re/g; print "$_: $result{$_}\n" for @keywords;
    There, the regex engine should only evaluate the code part if the text has matched, and then restart the regex at pos for the next match. Untested,

    Update: Perl qr// dosn't seem to like running the (?{$result{$_}++}) bit. I'm not sure why. Anybody know?

    A third way is to munch through the string with index for each word you want to match.

    It may be worthwhile to study your text before running the regex matches on it. Benchmark your different approaches, chances are that each will be best for some cases.

    After Compline,
    Zaxo

      I've had a look at your pretty solution (with qr). However, it currently does not work, for several reasons.

      The first in the one you mention : it does not compile, because of qr and (?{...}) block. That is explained in perlre, about the (?{...}) block :

      For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous use re 'eval' pragma has been used (see re), or the variables contain results of qr// operator (see perlop/"qr/STRING/imosx").

      In short, that is to prevent a external variable containing such a block to be executed in the regex engine. The solution is described in the doc : use re 'eval'

      However, some other thing does not work yet : when the regex qr/(?:(\Q$_\E))(?{$result{$_}++})/ is matched, that does not set the %result hash entry, because at execution time, $_ is not what you expect. If you look at the regex created :

      print $re; __DATA__ (?-xism:(?-xism:(?:(foo))(?{$result{$_}++}))|(?-xism:(?:(bar))(?{$resu +lt{$_}++}) )|(?-xism:(?:(12345))(?{$result{$_}++}))|(?-xism:(?:(abcd))(?{$result{ +$_}++})))

      you'll see that $_ variable is not replaced by the keyword values. To my understanding, it is because (?{...}) contains code that will be interpreted later, so perl does not interpret in at that time, leaving it unmodified.

      So, to do what we want here, I use (...) to catch a execution time the match and set the correct hash entry. There is a final problem : the name of the variable.

      my $data = "foofdsfdsbar"; my @matches = $data =~ m/(foo)(?{print "($1,$2,$+)"})|(bar)(?{print "($1,$2,$+)" +})/g; __DATA__ (foo,,foo)(,bar,bar)

      That is to say : $+ is the only one that surely contains the last pattern matched (BTW, I know about performance penalty when using it, but I won't care for now ;)

      So, let's code it !

      #!/usr/bin/perl use strict; use re 'eval'; my @keywords = qw/foo bar 12345 abcd/; my ($string) = "foobarfoo1234523423412345abcdefsadfabc"; our %result; my @regexen = map { qr/(?:(\Q$_\E))(?{$result{$+}++})/ } @keywords; my $re = do { local $" = '|'; qr/@regexen/; }; my @match = $string =~ /$re/g; print "$_: $result{$_}\n" for @keywords;

      --
      zejames
Re: Count multiple pattern matches
by tilly (Archbishop) on Dec 07, 2004 at 02:50 UTC
    Use a capturing subgroup.
    use strict; my @keywords=qw/foo bar 12345 abcd/; my $string = "foobarfoo1234523423412345abcdefsadfabc"; my %result; my $pattern = join "|", @keywords; while ($string =~ /($pattern)/ig) { $result{$1}++; } foreach my $s (keys %result){ print "$s=>$result{$s}\n"; }
    As an efficiency note I'm not sure whether the pattern above keeps on getting recompiled. You may wish to benchmark a few variations. It is also possible that it is faster to match many times than it is to do one big complex match - I've certainly seen that happen in the past. And you should definitely look at Regexp::PreSuf.

    As a bug note, if your strings are "foo", "bar", and "foobar", the original approach will correctly spot one of each in "foobar". This approach will not. (Depending on order it might spot "foobar" or it might spot "foo" and "bar".) This gets even more complicated if you're looking at things like "bazfoo" and "foobar".

Re: Count multiple pattern matches
by bobf (Monsignor) on Dec 07, 2004 at 05:58 UTC

    If the 'pretty long' string is 'long enough' and you have a lot of keywords, you might be better off using a non-regex solution. The example below creates an index of $string, so finding exact matches is fast. Using very short substrings for the index will not be very efficient, though, so if your keywords are only a few characters a regex might be better.

    This example uses a snippet of DNA sequence for $string and an arbitrary substring length of 3 (but that could be set based on the length of the shortest element in @keywords instead). The location of all the matches are stored in %matches, but you could speed things up by just incrementing a counter if you didn't need them for anything later.

    use strict; use warnings; my $string = 'ccaaactcagtggggtgaatggggcttctctgtgctctgatagcttccctaccctt +tcccttctccagctcccgtcccttctgactgtgagcagccccctcctctccactgttcccctcctgttg +tcagaggagggcccagctgaggcagggactggaccaccggctggggtgtccctaggggtcttgggtggc +tggcagtagtggagcctggggctgagaggggaagcaaaataagattgtcctccaacttagccatcctca +ggcctgctggggctatttaactggctgggcctgcatggcgacagggcccctacagcctccctgggaaca +aggggtgaagggttcagggggaagggggtcacagagtgatggagaaacctcttgagaacaaactaggct +ccctcatgctggagtccaaggctgagtacctcccttctctgaaacagagcaacaaccccactcccaccc +cgagtctgtc'; my @keywords = qw( ttc gctg ccaac ggggct ccc ); my $substrlen = 3; # or use length of shortest element in @keywords # create an index of all substrings of length $string my %substrings; for( my $i = 0; $i <= length($string) - $substrlen; $i++ ) { my $substring = substr( $string, $i, $substrlen ); push( @{ $substrings{$substring}{hits} }, $i ); } # search the index for all elements of @keywords my %matches; foreach my $keyword ( @keywords ) { my $subkey = substr( $keyword, 0, $substrlen ); if( exists $substrings{$subkey} ) { foreach my $hit ( @{ $substrings{$subkey}{hits} } ) { if( $keyword eq substr($string, $hit, length($keyword)) ) { push( @{ $matches{$keyword} }, $hit ); } } } } foreach my $keyword ( keys %matches ) { print "Found $keyword ", scalar @{ $matches{$keyword} }; printf " time%s at position%s: ", scalar @{ $matches{$keyword} } > 1 ? 's' : '', scalar @{ $matches{$keyword} } > 1 ? 's' : ''; print join( ', ', @{ $matches{$keyword} } ), "\n"; } ** OUTPUT ** Found gctg 8 times at positions: 140, 164, 192, 214, 268, 286, 408, 42 +1 Found ttc 8 times at positions: 25, 44, 55, 60, 78, 111, 344, 435 Found ccaac 1 time at position: 245 Found ccc 22 times at positions: 46, 51, 57, 70, 75, 94, 95, 96, 113, +114, 136, 174, 309, 310, 321, 401, 432, 456, 457, 463, 467, 468 Found ggggct 3 times at positions: 20, 211, 271

    I just thought a different approach might be interesting. YMMV. I'm sure there is a break even point between creating a regex with a lot of alternation and the time spent creating the index. I'd recommend benchmarking a few methods with some of your actual data to see what works best.

Re: Count multiple pattern matches
by Mr. Muskrat (Canon) on Dec 07, 2004 at 02:39 UTC

    It's probably not the best way to do it but this is what sprang to mind when I saw this post. It's also very similar to what you already have.

    use strict; my @keywords = qw/foo bar 12345 abcd/; my $string = "foobarfoo1234523423412345abcdefsadfabc"; my $re = join('|',@keywords); my %result; $result{$1}++ while ($string =~ /($re)/ig); foreach my $s (keys %result){ print "$s=>$result{$s}\n"; }

Re: Count multiple pattern matches
by Light Elf (Initiate) on Dec 07, 2004 at 10:18 UTC
    removing previous keywords from a string should make your algoritm faster.
    ... foreach my $k (@keywords){ my $matches_count = ( ($string =~ s/$k//ig ) ); ...

      This won't work correctly if keywords overlap.

      For example, consider the string foops. This actually matches both foo and oops. But if you remove foo in the first iteration, the remaining string (ps) won't match the keyword oops, thus reducing the match count for oops from 1 (the correct result) to 0.

Re: Count multiple pattern matches
by kappa (Chaplain) on Dec 07, 2004 at 15:01 UTC
    Try Regexp::List. It converts qw/foobar fooxar foozap fooza/ to qr/foo(?:[bx]ar|zap?)/ using Trie structure so that the resulting regex is much more efficient than naive /foobar|fooxar|.../. And tillys approach to counting each word by capturing the match sounds great.
    --kap
Re: Count multiple pattern matches
by melora (Scribe) on Dec 07, 2004 at 14:12 UTC
    (sorry, just not feeling up to coding at the moment)

    How about this idea: put the keywords into a hash, with a zero as the hash value for each keyword. Then for each token in the input string, increment the hash value for that token (check to see whether it's in the hash first?). The non-zero hash values are then counts of how many times each keyword appears in the input string. That leaves the hunting for the tokens to the hash implementation... but you still have to look for the non-zero hash elements. Something like foreach keyword in the hash, if the value is greater than zero, spit out the keyword and the value (then set it to zero again, if this is all inside a loop)... ? The hunting-through-an-array thing is one of my old C habits that I'm trying to break.

    Please don't scream, just gently correct me if I'm being stupid.
Re: Count multiple pattern matches
by pijll (Beadle) on Dec 07, 2004 at 16:39 UTC
    Have you tried to insert a study $string; outside the loop, before you do the first regex match? This should speed up the matches, especially if $string is very long.
    perldoc -f study
    for more information.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://412798]
Approved by kvale
Front-paged by sgifford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-28 10:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found