Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

how to find hash keys in a string ?

by Anonymous Monk
on May 03, 2003 at 09:44 UTC ( [id://255290]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks,
lets say I have a hash :
 my %temphash = ( "tom" => "good", "dick" => "bad", "hary" => "ugly");
and I have a string $temp = "This is a string containg tom and dick"
How can I check this string to see if it contains any keys from the %temphash ?
thanks in advance..
/van

Replies are listed 'Best First'.
Re: how to find hash keys in a string ?
by JaWi (Hermit) on May 03, 2003 at 10:20 UTC
    I've used the next trick several times (update see also barts reply below for some -necessary- additional pointers):
    use strict; my %temphash = ( "tom" => "good", "dick" => "bad", "hary" => "ugly" ); my $temp = "This is a string containing tom and dick"; my $regexp = join('|',keys %temphash); print "Found: [$1]\n" while ( $temp =~ /($regexp)/cg );
    Which results in:
    Found: [tom] Found: [dick]

    -- JaWi

    "A chicken is an egg's way of producing more eggs."

      That's basically the way how I would have done it myself, so no bad word from me there. :] Well, some remarks:
      1. I think this likely should only try to match whole words, so it won't try to match "tom" inside "stomach". For that, add '\b' anchors:
        /\b($regexp)\b/
        with whatever modifiers you like. If this regexp won't ever change, I'd add the /o modifier.
      2. If the "words" can contain special characters, the words should be quotemeta'ed before being added to the regexp:
        my $regexp = join('|', map quotemeta, keys %temphash);
      3. If the number of words to match can get big, it might be worthy to find a way to construct a cleverer regexp from this wordlist. There's a useful module on CPAN that does just that: Regex::PreSuf. To quote the description from the docs:
        The presuf() subroutine builds regular expressions out of 'word lists', lists of strings. The regular expression matches the same words as the word list. These regular expressions normally run faster than a simple-minded '|'-concatenation of the words.
        The larger the wordlist, the higher the gain likely will be.
        bart,
        I have found cases where no matter how good you made the RE, if there were a lot of things to match and the data to match on was also large - it was never faster than matching each piece individually. I offer this alternative - which uses your suggestions.
        #!/usr/bin/perl -w use strict; my %temphash = ( 't[0-3]om' => "good", "dick" => "bad", "hary" => "ugl +y" ); my $temp = 'This is a string containing t[0-3]om and dick'; my @results = map {my $var = quotemeta $_;$temp =~ /\b($var)\b/} keys +%temphash; print "$_ matched\n" foreach(@results);
        In this very small example, I do not know if this method would be faster - but hey - it is nice to have in the tool box.

        Cheers - L~R

        Update: See this for an example of how to combine this method with a sub to make the process even faster if you have to reuse it.

        Thanks a lot guys
        just one question though, is'\b' suppose to match 'tom' in 'stomach' ?(I m not sure about this) but isn't '\b' suppose to match things ending in word boundries ?
Re: how to find hash keys in a string ?
by Dr. Mu (Hermit) on May 03, 2003 at 19:42 UTC
    Another method not involving regular expressions -- well, okay, ignoring the one in split -- is the following:
    use strict; my %temphash = ("tom" => "good", "dick" => "bad", "hary" => "ugly"); my $string = 'This string contains tom and dick'; my @occurrences = grep {exists $temphash{$_}} split(/\s/, $string); print scalar @occurrences;
    If you want to know where the matches occur, you can use map instead of grep.

    Is this faster than the regexp method? I'm not sure. The hash lookups are certainly quick enough...

      Excellent++. With 100 keys looked up in a string that contains them all in random order. The big regex beats out the map option, but your invert-the-logic hash lookup wins easily (2 1/2 x quicker than the regex).

      By the time you get to 1,000 keys, the map overtakes the regex but yours is an even clearer winner (nearly 30x faster).

      By the time you get to 10,000 keys, the margin is over 250x as fast.

      Results

      s/iter regex map hash regex 119 -- -71% -100% map 34.7 243% -- -99% hash 0.443 26705% 7715% -- 10000:10000:10000

      I think its better to use the magical split ' ', rather than /\s/, as the latter will generate nulls for consecutive whitespace and even /\s+/ will generate one null if there is any leading whitespace.

      print join '|', split /\s/, ' the quick brown fox j +umps ' |||||||||the||quick||||brown||||fox|||||jumps print join '|', split /\s+/, ' the quick brown fox +jumps ' |the|quick|brown|fox|jumps print join '|', split ' ', ' the quick brown fox ju +mps ' the|quick|brown|fox|jumps

      Benchmark


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      Dr. Mu,
      Ok - I admit defeat. This method is certainly faster than mine. There are a couple of things I would like to point out though.

      You have chosen to work of words in the string rather than the keys of the hash. If someone were really doing this type of analysis on a large data set, it might be worthwhile to have multiple methods and choose the best one based on the data set. For instance - imagine that there are only 3 hash keys, but the string is 1,500 words long. It certainly doesn't make sense to use this method.

      You have also assumed that the hash keys will not contain any spaces. My solution allows for this, but still has the requirement of having a word border on both ends. For insance:

      my $temphash{'good boy'} = "blah"; my $string = "He was a good boy when we went to the store"; my $otherstring = "He is worthless, not a good boyfriend at all";
      Your method will not work in either case. My method will correctly match in $string but will correctly fail in $otherstring.
      Of course, knowing your data is what counts so that you can code for it.

      My hat is off to you for such an innovative solution.

      Cheers - L~R

        All good points! -- except the admitting defeat part. ;-) The ever-unfolding wonder of PM is the diversity of approaches to any given problem and the discussions that they stimulate. We all win as a consequence. Frankly, I didn't really consider my method to be that innovative, hoping instead to employ (and learn about) hash slices in lieu of the grep -- but to no avail. Nonetheless, what resulted was different enough, and it worked -- subject to the very real limitations you point out. Thanks and ++ to BrowserUK for troubling to do the efficiency analysis that I was too lazy to perform!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://255290]
Approved by JaWi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2024-03-28 11:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found