http://qs321.pair.com?node_id=101878

Boudicca has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, noble monks -- I have a quandry:

Say you have a string and a few patterns to try matching, something like this (it's dinky, please forgive):

$string = "clintonesque";
$m1 = "Clinton";
$m2 = "Bush";
$m3 = "Reagan";
if ($string =~ m/($m1|$m2|$m3)/i) {
print "Got a match.\n";
} else {
print "No match.\n";
}

What I want to know is _which_ of those three things ($m1, $m2, $m3) it matched first. Can it be done, or am I delusional?

I should note that I'd like to put all of the match patterns into one monster regexp, instead of if/else-ing my way through a series of them. (Honestly, I don't know why; this was a suggestion from a more experienced engineer, in the interest of conserving computer juice with some pre-compiling.)

The FAQs/resources don't offer any aid, and other Perl wizards say it can't be done. I would be intensely grateful for any ideas...

Thanks!

Replies are listed 'Best First'.
Re: Returning _which_ pattern matched...?
by bikeNomad (Priest) on Aug 03, 2001 at 08:06 UTC
    No, it's quite simple. After the match, the variable $1 will be set to whatever matched your pattern -- which will be one of the three. Likewise for $2, etc. for later () sub-patterns. You'll still have to do something with that information, of course.
Re: Returning _which_ pattern matched...?
by Hofmator (Curate) on Aug 03, 2001 at 13:47 UTC

    bikeNomad already answered your question, just let me elaborate a little bit on some of your statements

    • (Honestly, I don't know why; this was a suggestion from a more experienced engineer, ...)
      you should always try to see the logic behind such statements. In your case of matching different pattern against a string, this solution might be more effective (to shorten my code, consider $string to be in $_): if (/($m1)/ || /($m2)/ || /($m3)/) This is quicker then your big regex if you know that $m1 will be more frequent than $m2 (which in turn is more frequent than $m3).
      The explanation is simple, in your case the regex engine has to check all three patterns for every position in the string. With the || construct, the checking of the $m2 and $m3 pattern might not be necessary. So you see, it pays off, to know your data. (Remark: In general, the size of a regex says nothing about its execution time, only the compile time increases. Execution time depends on the content!)
    • If speed is an issue, consider the /o modifier and qr//. Furthermore you might try to lowercase the whole string and the patterns instead of matching /i. And studying the string might also help.
    • For all this advice, the 'correct' way can be found out by using Benchmark on some real data. By doing that you can find out yourself what works when and don't have to rely on some 'experts'.
    • The FAQs/resources don't offer any aid, and other Perl wizards say it can't be done.
      Just to show you TMTOWTDI - not that I recommend this if (/($m1(?{$match=1})|$m2(?{$match=2})|$m3(?{$match=3}))/) { ... you can execute code from within the regex ... and this is a really bad example of this powerful feature - but a way to do it :)

    -- Hofmator

Re: Returning _which_ pattern matched...?
by lestrrat (Deacon) on Aug 03, 2001 at 18:53 UTC

    If you happen to have a lot of patterns to match, I like the "build a sub on the fly" approach

    my @patterns = qw/ foo bar baz /; ## if you need to return the pattern itself, return '$_' instead my $match_n_return = join( "\n", map{ "return \$1 if \$_[0] =~ /$_/o;" } @patterns ); my $sub = eval qq|sub { $match_n_return return undef }|; if( $@ ) { ## some syntax error in the eval string die $@ } my $matched = $sub->( "string you want to match" );

    The only problem here is, you don't get a fine granularity of control for each match. But I like the fact that this can be used over and over, and that since all the regexes are pre-compiled, you save some time for repeated use

Re: Returning _which_ pattern matched...?
by scain (Curate) on Aug 03, 2001 at 19:02 UTC
    You could also consider a more difficult example where you are trying to match lines a data that fall into different "types", so you have different regexs to catch them, even though you are trying to collect the same data regardless of the form (clear as mud so far, right?).

    Let's just say that you end up with three different regexs. You deside to write three different ones because it would have made the single combined regex a god-awful mess. Good. Now, if you use Hofmator's suggestion, you'll get something like this:

    if ($string =~ /re1/ || $string =~ /re2/ || $string =~ /re3/)
    where $1 will be populated with what you are looking for, but you still may not know which of re1, 2 or 3 matched. In this case I would use a bigger if. If you want to know which regex matched, it is presumably because you want to do different things with the results. So put them all in if..else's:
    if ( $string =~ /re1/ ) { #do thing 1 } elsif ( $string =~ /re2/ ) { #do thing 2 } elsif ( $string =~ /re3/ ) { #do thing 3 }
    It is bulkier than other suggestions, but I suspect it is a more generally useful answer to your question.

    Scott

(dkubb) Re: (2) Returning _which_ pattern matched...?
by dkubb (Deacon) on Aug 03, 2001 at 20:33 UTC

    Remember to use quotemeta or \Q and \E on any variables you are going to use inside a regex. If you forget to, and any special regex meta-characters exist in your variables, your program can do anything between match unwanted strings to completely die'ing.

    I see this problem in alot of code where the program "builds" a regex. Bugs caused by forgetting to do this can go undetected for a long time until a "+", "|" or a ")" occurs somewhere in the incoming data. (assuming the data being searched through is not hard coded into the script, and thus it's content controlled by the programmer - as it's often not in the real world)

    Here's some sample code to illustrate the use of quotemeta in solving your problem:

    #!/usr/bin/perl -w use strict; use constant STRING => 'clintonesque'; use constant TO_MATCH => qw( Clinton Bush Reagan ); my $regex = join '|', map { quotemeta } TO_MATCH; my ($first_match) = STRING =~ /($regex)/i; print $first_match;