Matching against list of patterns

Eyck has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching against list of patterns by tachyon (Chancellor) on Sep 16, 2004 at 12:25 UTC
The typical method is to dynamically build an alternation RE and let the RE engine do all the optimisation and hard work in lovingly hand optimised C. We use qr// so the RE is compiled once and avoid vast numbers of hash lookups, functionally useless variable assignments*, and RE compiles that you are doing above. As a general rule loops are great territory for optimisation due to the repeated execution of the loop code. `my @terms = qw( foo bar baz fool foolish ); my $re = join '\|', map{quotemeta} sort {length($b)<=>length($a)}@terms +; $re = qr/($re)/; while(<DATA>){ my @matches = $_ =~ m/$re/g; next unless @matches; print "Got @matches\n" } __DATA__ Only a foobar foolish fool skips the sort by length Without this foo you look foolish and may match short elements!` [download] Hash lookup tables may still have a place to convert a matched value into something else. Here is a trivial example. Beware /i as you need to lc($1) and use all lowercase keys in %terms or you won't get a match/lookup. `my %terms = ( foo => 'FOO', bar => 'BAR' ); my $re = join '\|', map{quotemeta} sort {length($b)<=>length($a)}keys % +terms; $re = qr/($re)/i; while(<DATA>){ s/$re/$terms{lc($1)}/g; print; } __DATA__ foo is a word Bar ba blacksheep` [download] cheers tachyon	[reply] [d/l] [select]
Re^2: Matching against list of patterns by Random_Walk (Prior) on Sep 16, 2004 at 14:26 UTC
Tachyon, I have a similar problem, over six hundred complex regexen to match against a busy logfile and related messages to issue depending on which regexps were matched. much the same problem as the OP If I understand the regexp engine caching the compiled version of a regex if it is not going to change then I think this should be a reasonably efficient approach. Am I on the right tracks ? and is the /o unrequired as I have already interpolated the variable when the regex is first called ? #!/usr/local/bin/perl -w use strict; my ($i, $compile_me, @names)=(1, "{my \@matches;", "no match"); while (<DATA>) { next if /^\s$/; last if /END CONFIG/; chomp; my ($name, $reg)=split; push @names, $name; $compile_me.="push \@matches, $i if /$reg/o;"; $i++; } $compile_me.="\@matches}"; while (<DATA>) { chomp; print "\nmatches found for $_\n"; my @matches=eval $compile_me; foreach (@matches) { print $names[$_] , "\n" } } __DATA__ Fred_and_Friends fr.d Paul_and_co paul some_numbers \d{2} freud_likes_fred fr END CONFIG freud fred NaNa pauline 12312sdfsdf 2 [download] Update with speed test* I have now run a comparative test over 300^H^H^H, sorry 416 lines of log, with my 672 pattern matches. First using the eval of a string containing all the regexen and returning match index numbers as above. Second is my old naive code holding an array with the regexen and doing a foreach through it against each line. I did not use the /o for the reasons given above it works fine without it `>time ./Monitor.fast real 0m1.49s user 0m0.68s sys 0m0.58s >time ./Monitor real 0m19.47s user 0m14.69s sys 0m0.50s >` [download] I think the numbers speak for themselves Cheers, R.	[reply] [d/l] [select]
Re^3: Matching against list of patterns by Eyck (Priest) on Sep 17, 2004 at 07:09 UTC
Let me try decompile/understand what's going on here, One obvious improvement is using `/o`, right? The other is unwinding foreach loop into linear list of matches, what is the gain in that? We avoid walking the array, but I never thought that operation is that costly? Is this what's going on or am I missing something?	[reply] [d/l]
Re^4: Matching against list of patterns by Random_Walk (Prior) on Sep 17, 2004 at 08:40 UTC
Re^5: Matching against list of patterns by Eyck (Priest) on Sep 17, 2004 at 09:40 UTC
Some notes below your chosen depth have not been shown here
Re^2: Matching against list of patterns by Eyck (Priest) on Sep 16, 2004 at 13:42 UTC
This is very nice, but this works well with terms, not that well with regular expressions (ie, not 'foo' but something like `'f[o\|O]{1,3}'` instead). Also, this solves little because this only tells me that my line matches one of my patterns.. that is great, but the problem here is finding WHICH re it matches. I look at your answer, and I think I'm not clever enough to extend this kind of idea to those requirements. And another thing, how flexible and scalable are perl's res, what happens when there's 1000s of ~40 chars long REs ganged together, will regexp engine/optimizer sort through that?	[reply] [d/l]
Re^3: Matching against list of patterns by tachyon (Chancellor) on Sep 16, 2004 at 23:00 UTC
Remove the map quotemeta and you can use REs to your hears delight. The only issue is then extra \| in those REs. My example captures the physical match. It is a simple task to map that back to the pattern (or patterns) that match that if required. The benchmark at Re^2: Matching against list of patterns shows a full order of magnitude speed improvement so the optimiser does OK. cheers tachyon	[reply]
Re^4: Matching against list of patterns by Eyck (Priest) on Sep 17, 2004 at 09:50 UTC
Re: Matching against list of patterns by davorg (Chancellor) on Sep 16, 2004 at 12:29 UTC
Can you explain a little more about your data structures. It looks like you have hash references as the keys in your %$users hash - and that strikes me as a little scary (and it almost certainly doesn't do what you think it does). The solution to your problem is probably to create a large regex from all off the individual regexes and use that instead. -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^2: Matching against list of patterns by Eyck (Priest) on Sep 16, 2004 at 13:56 UTC
What is so scary about hashrefs? And I don't know how to create and use such beast, how can I figure out which re of `join "\|" , @res` matched? And wouldn't that strain our re engine too much (we're talking about scalability here, will hardware improvements keep such code running with number of patterns and lines to parse linearly growing? )	[reply] [d/l]
Re^3: Matching against list of patterns by davorg (Chancellor) on Sep 16, 2004 at 15:01 UTC
What is so scary about hashrefs? Absolutely nothing. But anything that you use as a hash key gets converted into a string. And that means that when you pull the keys back out you can no longer use them as hash references - which is what you seem to be doing. Can you show us the code that builds your %$users hash? -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: Matching against list of patterns by grinder (Bishop) on Sep 16, 2004 at 13:43 UTC
What do people do with such problems? [...] groups based on common prefix Yes, this is exactly what I am doing. If you don't care which pattern matched, just that one pattern among many matched, then you can do just that. I am in the middle of writing a module that will let you assemble an arbitrary number of regular expressions, and combine them into trie structure, and from there produce an expression that will produce a minimal amount of backtracking behaviour. (Things like `/a.b/` notwithstanding). The basic stuff works and tests okay, I'm adding code to shorten the length of the generated regular expression. The main problem with using a brute force solution of just `\|`ing the different patterns together is that on a target string that does not match, the engine will have to do a significant amount of work before working out that this is in fact the case. In the 5.10 TODO list, Nicholas Clark talks about the fact that it would be nice to have the RE compiler to do this for you, but having already battled doing it in Perl, I think I'd rather gouge my eyes out with a blunt stick than to recast the algorithm in C. ... and as far as I know (and have asked) no, no-one appears to have done this yet, which sort of surprised me. update:* urg! I missed the part about the fact that you did want to know which one matched. I'm using this in relation to spam, which is why I don't care what matched, just that something matched and therefore triggers a different course of events. Thinking about how to extend the approach, I think that all I need to do is to add code into the end of each RE with the `(?{ code })` construct that sets a variable to record which rule matched. Hmm, I'll put that in the TODO list, but first off I'd like to get the damned thing out the door. - another intruder with the mooring of the heat of the Perl	[reply] [d/l]
Re^2: Matching against list of patterns by Eyck (Priest) on Sep 16, 2004 at 14:01 UTC
I care about which pattern matched, that's the whole point Where can I find more about what you're working on?	[reply]
Re: Matching against list of patterns by BrowserUk (Patriarch) on Sep 16, 2004 at 15:40 UTC
Here's one way--not sure how it would scale though. Updated to remove a couple of artifacts. Read more... (5 kB) Output: P:\test>391416 users [barney wilma] selected: 'Fred Flintstone. He is a roly-poly ro +tund fellow. I always found it ' users [barney wilma] selected: 'amazing, that he wore the same outfit +day in and day out. Fred works ' users [barney wilma] selected: 'boulders. His boss is Mr. Slate. Fred +is the typical prehistoric man. ' users [barney wilma] selected: 'is, I will never know. haha. Fred is a +lso a member of a lodge and he ' users [barney wilma fred] selected: 'Wilma Flintstone. Wilma is marri +ed to Fred. She is a slim, willowy ' users [barney wilma fred] selected: 'up with fat Fred, I will never kn +ow. Wilma keeps a very clean ' users [barney wilma fred] selected: 'Wilma and Fred Flintstone have ne +ighbors, who happen to be their ' users [barney betty wilma fred] selected: 'Barney Rubble. He is short +er than Fred. He is Fred's best friend and ' users [barney betty wilma fred] selected: 'always ends up involved in +Fred's shenanigans. Barney rides to work ' users [barney betty wilma fred] selected: 'with Fred. This is not a go +od arrangement because Barney has had to ' users [barney betty wilma fred] selected: 'run after Fred on more than + one occasion. Barney is also married to a ' users [barney] selected: 'pretty woman. Her name is Betty.' users [barney betty fred] selected: 'Betty Rubble. She is taller than + Barney and weighs about 50 pounds ' users [fred] selected: 'day after day. Her best friend is Wilma, of co +urse. Their houses are ' users [barney fred] selected: 'side by side, so Betty is at Wilma's ho +use a lot and vice-versa. ' users [barney fred] selected: 'Neither Betty nor Wilma work for a livi +ng.' users [barney fred] selected: 'Wilma and Betty have one child each. Th +ey are:' users [barney wilma fred] selected: 'Pebbles. She is the daughter of +Wilma and Fred. She is a very cute ' users [wilma] selected: 'Pebbles is one of the happiest and smartest b +abies ever on television.' users [barney betty fred] selected: 'Bam-Bam. Barney and Betty did no +t give birth to Bam-Bam. Someone left' users [barney betty fred] selected: 'so happens, the night before, Bar +ney and Betty had wished for a child ' users [betty fred] selected: 'because they wanted to be parents. Bam-B +am is adopted by Barney and ' users [barney betty wilma] selected: 'Betty. He is a very strong littl +e baby. Of course, Bam-Bam and Pebbles ' users [barney wilma] selected: 'Fred Flintstone also has to endure liv +ing with animals also. They have ' users [barney wilma] selected: 'Fred and loves to greet him when he co +mes homes. Fred, on the other ' users [wilma] selected: 'just so happens to be a good baby sitter/watc +hdog for Pebbles.' users [barney betty fred] selected: 'Barney and Betty have a kangaroo +type animal named Hoppy. Hoppy and ' users [betty] selected: 'Dino are best friends. Hoppy also lets Bam-Ba +m get in his pouch.' users [barney betty wilma fred] selected: 'named Kazoo. He only can be + seen by Fred and Barney. He has helped ' users [barney wilma] selected: 'Fred get out of more than one fix. In +other episodes, when Fred needed ' users [barney wilma] selected: 'Fred drives his car with foot power. I + was always amazed at the ' [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re: Matching against list of patterns by hv (Prior) on Sep 16, 2004 at 14:07 UTC
There are a couple of ways to discover what pattern matched. The easiest is probably to use backrefs: `my $pattern = join '\|', map "($_)", @patterns; ... if ($line =~ /$pattern/i) { my $matchref = $#-; print "matched $pattern[$matchref - 1]\n"; }` [download] As warned by other posters, you may need to play with the ordering to ensure that the preferred pattern matches if more than one can, in which case you will also need a way to get back from the index of the matched backreference to the actual pattern. Hugo	[reply] [d/l]
Re^2: Matching against list of patterns by Schuk (Pilgrim) on Nov 17, 2004 at 12:42 UTC
To get the ~~matched pattern~~ matched text you can simply do: `my $pattern = join '\|', map "($_)", @patterns; ... if ($line =~ /($pattern)/i) { print "matched $1\n"; }` [download] But be carefull if you want to get another matched text in this line. Somewhat it eats up all $1,$2,$3. Use $+ to get the last matched text of the line. Schuk Edit: Sorry HV I should learn to read more carefully.	[reply] [d/l]
Re^3: Matching against list of patterns by hv (Prior) on Nov 18, 2004 at 13:07 UTC
That shows you what text matched, not which pattern matched it. There's nothing wrong with it as a way of finding out what text matched, but in the node you replied to I was specifically trying to address this aspect of the root node: the whole point of this excercise is to figure out WHICH regexp matched. Hugo	[reply]
Re: Matching against list of patterns by diotalevi (Canon) on Sep 16, 2004 at 15:54 UTC
For this matching, is there only one user per line? If a user is found on a line, is it then not ever found again? I've written samples for you which answer those questions differently. There may be ways to optimize each of these for your specific problem. It just depends on which problem you are solving. There is a bug in your original code - you said `keys %$users` and then dereferenced the key directly like `$user->{'Pattern'}`. `$user` is a plain string so that is a symbolic reference. Using strict would have caught that bug for you. You meant to write `$users->{ $user }{ 'Pattern' }` which properly looks up the value named `$user` in the hash reference `$users`. There is a potential bug depending on your data. The string `"aa"` matches `"a"` and `"aa"`. If you ask for only the first match, then the more complete, perhaps more correct match will not be attempted if you stop. You may need to adjust your logic to account for the length of the match to see which pattern matched "better". None of my examples correct for this. Each line may match multiple users and once found, are not looked for anymore. This may be be the fastest because it can reduce the search space by multiples with each iteration. # Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; my %unmatched_users; @unmatched_users{ keys %$users } = (); while ( my $line = <> ) { ... my @users = grep $line =~ $users->{$_}{'CompiledPattern'}, keys %unmatched_users; if ( @users ) { warn "Great, we found " . join( ', ', map $_->{'Pattern'}, @{$users}{ @users } ) . " user(s)!\n"; delete @unmatched_users{ @users }; } else { warn "$line didn't match any users.\n"; } } [download] Each line may match one user. Once a user is found, it is not looked for anymore. This may be be the fastest because it reduces the search space with each successful match and if any match is found, stops looking for any more. use List::Util 'first'; # Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; my %unmatched_users; @unmatched_users{ keys %$users } = (); while ( my $line = <> ) { ... my $user = first { $line =~ $users->{$_}{'CompiledPattern'} } keys %unmatched_users; if ( defined $user ) { warn "Great, we found pattern $user->{'Pattern'}!\n"; delete $unmatched_users{ $user }; } else { warn "$line didn't match any users.\n"; } } [download] Each line may match one user but users may be found on multiple lines. The search space remains constant. `# Precompile all the patterns and store them into the key CompiledPatt +ern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; while ( my $line = <> ) { ... my $user = first { $line =~ $users->{$_}{'CompiledPattern'} } keys %$users; if ( $user ) { warn "Great, we found pattern $user->{'Pattern'}!\n"; } else { warn "$line didn't match any users.\n"; } }` [download] Each line may match multiple users and users may be found on multiple lines. This is the worst case sample you already had. `# Precompile all the patterns and store them into the key # CompiledPattern $_->{'CompiledPattern'} = qr/$_->{'Pattern'}/i for values %$users; while ( my $line = <> ) { ... my @users = grep $line =~ $users->{$_}{'CompiledPattern'}, keys %$users; if ( @users ) { warn "Great, we found " . join( ', ', map $_->{'Pattern'}, @{$users}{ @users } ) . " user(s)!\n"; } else { warn "$line didn't match any users.\n"; } }` [download]	[reply] [d/l] [select]


The stupid question is the question not asked
	PerlMonks