Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Matching against list of patterns

by Eyck (Priest)
on Sep 16, 2004 at 11:50 UTC ( [id://391416]=perlquestion: print w/replies, xml ) Need Help??

Eyck has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I find myself writting another app this month that requires matching against list of patterns, what I did the last time is:

#... while ($line=<>) { #... foreach $user (keys(%$users)) { my $pattern=$user->{'Pattern'}; if ($line=~/$pattern/i) { print STDERR "Great, we found pattern $pattern!\n"; } }; };

This is straitghtforwad and it works.

The problem is that it grows roughly N**2, (because there is relation between number of lines to parse, and amount of possible patterns).

UPDATE: I keep patterns in hashref because the whole point of this excercise is to figure out WHICH regexp matched. ORing patterns together might be fine if I knew how to figure out which patter from such ORed pattern matched exactly.

What do people do with such problems? Is it possible to optimise this? ( for example, group patterns into eee...groups based on common prefix or something common, or maybe preprocess those lines, split sentences into functional parts .. )

This seems like rather common task, are there any smart (or at least working) solutions?

Replies are listed 'Best First'.
Re: Matching against list of patterns
by tachyon (Chancellor) on Sep 16, 2004 at 12:25 UTC

    The typical method is to dynamically build an alternation RE and let the RE engine do all the optimisation and hard work in lovingly hand optimised C. We use qr// so the RE is compiled once and avoid vast numbers of hash lookups, functionally useless variable assignments*, and RE compiles that you are doing above. As a general rule loops are great territory for optimisation due to the repeated execution of the loop code.

    my @terms = qw( foo bar baz fool foolish ); my $re = join '|', map{quotemeta} sort {length($b)<=>length($a)}@terms +; $re = qr/($re)/; while(<DATA>){ my @matches = $_ =~ m/$re/g; next unless @matches; print "Got @matches\n" } __DATA__ Only a foobar foolish fool skips the sort by length Without this foo you look foolish and may match short elements!

    Hash lookup tables may still have a place to convert a matched value into something else. Here is a trivial example. Beware /i as you need to lc($1) and use all lowercase keys in %terms or you won't get a match/lookup.

    my %terms = ( foo => 'FOO', bar => 'BAR' ); my $re = join '|', map{quotemeta} sort {length($b)<=>length($a)}keys % +terms; $re = qr/($re)/i; while(<DATA>){ s/$re/$terms{lc($1)}/g; print; } __DATA__ foo is a word Bar ba blacksheep

    cheers

    tachyon

      Tachyon,
      I have a similar problem, over six hundred complex regexen to match against a busy logfile and related messages to issue depending on which regexps were matched. much the same problem as the OP

      If I understand the regexp engine caching the compiled version of a regex if it is not going to change then I think this should be a reasonably efficient approach. Am I on the right tracks ? and is the /o unrequired as I have already interpolated the variable when the regex is first called ?

      #!/usr/local/bin/perl -w use strict; my ($i, $compile_me, @names)=(1, "{my \@matches;", "no match"); while (<DATA>) { next if /^\s*$/; last if /END CONFIG/; chomp; my ($name, $reg)=split; push @names, $name; $compile_me.="push \@matches, $i if /$reg/o;"; $i++; } $compile_me.="\@matches}"; while (<DATA>) { chomp; print "\nmatches found for $_\n"; my @matches=eval $compile_me; foreach (@matches) { print $names[$_] , "\n" } } __DATA__ Fred_and_Friends fr.d Paul_and_co paul some_numbers \d{2} freud_likes_fred fr END CONFIG freud fred NaNa pauline 12312sdfsdf 2
      Update with speed test

      I have now run a comparative test over 300^H^H^H, sorry 416 lines of log, with my 672 pattern matches. First using the eval of a string containing all the regexen and returning match index numbers as above. Second is my old naive code holding an array with the regexen and doing a foreach through it against each line. I did not use the /o for the reasons given above it works fine without it

      >time ./Monitor.fast real 0m1.49s user 0m0.68s sys 0m0.58s >time ./Monitor real 0m19.47s user 0m14.69s sys 0m0.50s >

      I think the numbers speak for themselves

      Cheers,
      R.

        Let me try decompile/understand what's going on here,

        One obvious improvement is using /o, right?

        The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly?

        Is this what's going on or am I missing something?

      This is very nice, but this works well with terms, not that well with regular expressions (ie, not 'foo' but something like 'f[o|O]{1,3}' instead).

      Also, this solves little because this only tells me that my line matches one of my patterns.. that is great, but the problem here is finding WHICH re it matches.

      I look at your answer, and I think I'm not clever enough to extend this kind of idea to those requirements.

      And another thing, how flexible and scalable are perl's res, what happens when there's 1000s of ~40 chars long REs ganged together, will regexp engine/optimizer sort through that?

        Remove the map quotemeta and you can use REs to your hears delight. The only issue is then extra | in those REs. My example captures the physical match. It is a simple task to map that back to the pattern (or patterns) that match that if required. The benchmark at Re^2: Matching against list of patterns shows a full order of magnitude speed improvement so the optimiser does OK.

        cheers

        tachyon

Re: Matching against list of patterns
by davorg (Chancellor) on Sep 16, 2004 at 12:29 UTC

    Can you explain a little more about your data structures. It looks like you have hash references as the keys in your %$users hash - and that strikes me as a little scary (and it almost certainly doesn't do what you think it does).

    The solution to your problem is probably to create a large regex from all off the individual regexes and use that instead.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      What is so scary about hashrefs?

      And I don't know how to create and use such beast, how can I figure out which re of  join "|" , @res matched?

      And wouldn't that strain our re engine too much (we're talking about scalability here, will hardware improvements keep such code running with number of patterns and lines to parse linearly growing? )

        What is so scary about hashrefs?

        Absolutely nothing.

        But anything that you use as a hash key gets converted into a string. And that means that when you pull the keys back out you can no longer use them as hash references - which is what you seem to be doing.

        Can you show us the code that builds your %$users hash?

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Matching against list of patterns
by grinder (Bishop) on Sep 16, 2004 at 13:43 UTC
    What do people do with such problems? [...] groups based on common prefix

    Yes, this is exactly what I am doing. If you don't care which pattern matched, just that one pattern among many matched, then you can do just that.

    I am in the middle of writing a module that will let you assemble an arbitrary number of regular expressions, and combine them into trie structure, and from there produce an expression that will produce a minimal amount of backtracking behaviour. (Things like /a.*b/ notwithstanding).

    The basic stuff works and tests okay, I'm adding code to shorten the length of the generated regular expression.

    The main problem with using a brute force solution of just |ing the different patterns together is that on a target string that does not match, the engine will have to do a significant amount of work before working out that this is in fact the case.

    In the 5.10 TODO list, Nicholas Clark talks about the fact that it would be nice to have the RE compiler to do this for you, but having already battled doing it in Perl, I think I'd rather gouge my eyes out with a blunt stick than to recast the algorithm in C.

    ... and as far as I know (and have asked) no, no-one appears to have done this yet, which sort of surprised me.

    update: urg! I missed the part about the fact that you did want to know which one matched. I'm using this in relation to spam, which is why I don't care what matched, just that something matched and therefore triggers a different course of events.

    Thinking about how to extend the approach, I think that all I need to do is to add code into the end of each RE with the (?{ code }) construct that sets a variable to record which rule matched. Hmm, I'll put that in the TODO list, but first off I'd like to get the damned thing out the door.

    - another intruder with the mooring of the heat of the Perl

      I care about which pattern matched, that's the whole point

      Where can I find more about what you're working on?

Re: Matching against list of patterns
by BrowserUk (Patriarch) on Sep 16, 2004 at 15:40 UTC

    Here's one way--not sure how it would scale though.

    Updated to remove a couple of artifacts.

    Output:

    P:\test>391416 users [barney wilma] selected: '*Fred Flintstone. He is a roly-poly ro +tund fellow. I always found it ' users [barney wilma] selected: 'amazing, that he wore the same outfit +day in and day out. Fred works ' users [barney wilma] selected: 'boulders. His boss is Mr. Slate. Fred +is the typical prehistoric man. ' users [barney wilma] selected: 'is, I will never know. haha. Fred is a +lso a member of a lodge and he ' users [barney wilma fred] selected: '*Wilma Flintstone. Wilma is marri +ed to Fred. She is a slim, willowy ' users [barney wilma fred] selected: 'up with fat Fred, I will never kn +ow. Wilma keeps a very clean ' users [barney wilma fred] selected: 'Wilma and Fred Flintstone have ne +ighbors, who happen to be their ' users [barney betty wilma fred] selected: '*Barney Rubble. He is short +er than Fred. He is Fred's best friend and ' users [barney betty wilma fred] selected: 'always ends up involved in +Fred's shenanigans. Barney rides to work ' users [barney betty wilma fred] selected: 'with Fred. This is not a go +od arrangement because Barney has had to ' users [barney betty wilma fred] selected: 'run after Fred on more than + one occasion. Barney is also married to a ' users [barney] selected: 'pretty woman. Her name is Betty.' users [barney betty fred] selected: '*Betty Rubble. She is taller than + Barney and weighs about 50 pounds ' users [fred] selected: 'day after day. Her best friend is Wilma, of co +urse. Their houses are ' users [barney fred] selected: 'side by side, so Betty is at Wilma's ho +use a lot and vice-versa. ' users [barney fred] selected: 'Neither Betty nor Wilma work for a livi +ng.' users [barney fred] selected: 'Wilma and Betty have one child each. Th +ey are:' users [barney wilma fred] selected: '*Pebbles. She is the daughter of +Wilma and Fred. She is a very cute ' users [wilma] selected: 'Pebbles is one of the happiest and smartest b +abies ever on television.' users [barney betty fred] selected: '*Bam-Bam. Barney and Betty did no +t give birth to Bam-Bam. Someone left' users [barney betty fred] selected: 'so happens, the night before, Bar +ney and Betty had wished for a child ' users [betty fred] selected: 'because they wanted to be parents. Bam-B +am is adopted by Barney and ' users [barney betty wilma] selected: 'Betty. He is a very strong littl +e baby. Of course, Bam-Bam and Pebbles ' users [barney wilma] selected: 'Fred Flintstone also has to endure liv +ing with animals also. They have ' users [barney wilma] selected: 'Fred and loves to greet him when he co +mes homes. Fred, on the other ' users [wilma] selected: 'just so happens to be a good baby sitter/watc +hdog for Pebbles.' users [barney betty fred] selected: 'Barney and Betty have a kangaroo +type animal named Hoppy. Hoppy and ' users [betty] selected: 'Dino are best friends. Hoppy also lets Bam-Ba +m get in his pouch.' users [barney betty wilma fred] selected: 'named Kazoo. He only can be + seen by Fred and Barney. He has helped ' users [barney wilma] selected: 'Fred get out of more than one fix. In +other episodes, when Fred needed ' users [barney wilma] selected: 'Fred drives his car with foot power. I + was always amazed at the '

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Matching against list of patterns
by hv (Prior) on Sep 16, 2004 at 14:07 UTC

    There are a couple of ways to discover what pattern matched. The easiest is probably to use backrefs:

    my $pattern = join '|', map "($_)", @patterns; ... if ($line =~ /$pattern/i) { my $matchref = $#-; print "matched $pattern[$matchref - 1]\n"; }

    As warned by other posters, you may need to play with the ordering to ensure that the preferred pattern matches if more than one can, in which case you will also need a way to get back from the index of the matched backreference to the actual pattern.

    Hugo

      To get the matched pattern matched text you can simply do:
      my $pattern = join '|', map "($_)", @patterns; ... if ($line =~ /($pattern)/i) { print "matched $1\n"; }
      But be carefull if you want to get another matched text in this line. Somewhat it eats up all $1,$2,$3. Use $+ to get the last matched text of the line.

      Schuk

      Edit: Sorry HV I should learn to read more carefully.

        That shows you what text matched, not which pattern matched it.

        There's nothing wrong with it as a way of finding out what text matched, but in the node you replied to I was specifically trying to address this aspect of the root node: the whole point of this excercise is to figure out WHICH regexp matched.

        Hugo

Re: Matching against list of patterns
by diotalevi (Canon) on Sep 16, 2004 at 15:54 UTC

    For this matching, is there only one user per line? If a user is found on a line, is it then not ever found again? I've written samples for you which answer those questions differently. There may be ways to optimize each of these for your specific problem. It just depends on which problem you are solving.

    There is a bug in your original code - you said keys %$users and then dereferenced the key directly like $user->{'Pattern'}. $user is a plain string so that is a symbolic reference. Using strict would have caught that bug for you. You meant to write $users->{ $user }{ 'Pattern' } which properly looks up the value named $user in the hash reference $users.

    There is a potential bug depending on your data. The string "aa" matches "a" and "aa". If you ask for only the first match, then the more complete, perhaps more correct match will not be attempted if you stop. You may need to adjust your logic to account for the length of the match to see which pattern matched "better". None of my examples correct for this.

    Each line may match multiple users and once found, are not looked for anymore. This may be be the fastest because it can reduce the search space by multiples with each iteration.

    # Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; my %unmatched_users; @unmatched_users{ keys %$users } = (); while ( my $line = <> ) { ... my @users = grep $line =~ $users->{$_}{'CompiledPattern'}, keys %unmatched_users; if ( @users ) { warn "Great, we found " . join( ', ', map $_->{'Pattern'}, @{$users}{ @users } ) . " user(s)!\n"; delete @unmatched_users{ @users }; } else { warn "$line didn't match any users.\n"; } }

    Each line may match one user. Once a user is found, it is not looked for anymore. This may be be the fastest because it reduces the search space with each successful match and if any match is found, stops looking for any more.

    use List::Util 'first'; # Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; my %unmatched_users; @unmatched_users{ keys %$users } = (); while ( my $line = <> ) { ... my $user = first { $line =~ $users->{$_}{'CompiledPattern'} } keys %unmatched_users; if ( defined $user ) { warn "Great, we found pattern $user->{'Pattern'}!\n"; delete $unmatched_users{ $user }; } else { warn "$line didn't match any users.\n"; } }

    Each line may match *one* user but users may be found on multiple lines. The search space remains constant.

    # Precompile all the patterns and store them into the key CompiledPatt +ern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; while ( my $line = <> ) { ... my $user = first { $line =~ $users->{$_}{'CompiledPattern'} } keys %$users; if ( $user ) { warn "Great, we found pattern $user->{'Pattern'}!\n"; } else { warn "$line didn't match any users.\n"; } }

    Each line may match multiple users and users may be found on multiple lines. This is the worst case sample you already had.

    # Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; while ( my $line = <> ) { ... my @users = grep $line =~ $users->{$_}{'CompiledPattern'}, keys %$users; if ( @users ) { warn "Great, we found " . join( ', ', map $_->{'Pattern'}, @{$users}{ @users } ) . " user(s)!\n"; } else { warn "$line didn't match any users.\n"; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://391416]
Approved by muntfish
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-04-16 07:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found