Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: how to find hash keys in a string ?

by JaWi (Hermit)
on May 03, 2003 at 10:20 UTC ( [id://255292]=note: print w/replies, xml ) Need Help??


in reply to how to find hash keys in a string ?

I've used the next trick several times (update see also barts reply below for some -necessary- additional pointers):
use strict; my %temphash = ( "tom" => "good", "dick" => "bad", "hary" => "ugly" ); my $temp = "This is a string containing tom and dick"; my $regexp = join('|',keys %temphash); print "Found: [$1]\n" while ( $temp =~ /($regexp)/cg );
Which results in:
Found: [tom] Found: [dick]

-- JaWi

"A chicken is an egg's way of producing more eggs."

Replies are listed 'Best First'.
Re: Re: how to find hash keys in a string ?
by bart (Canon) on May 03, 2003 at 10:42 UTC
    That's basically the way how I would have done it myself, so no bad word from me there. :] Well, some remarks:
    1. I think this likely should only try to match whole words, so it won't try to match "tom" inside "stomach". For that, add '\b' anchors:
      /\b($regexp)\b/
      with whatever modifiers you like. If this regexp won't ever change, I'd add the /o modifier.
    2. If the "words" can contain special characters, the words should be quotemeta'ed before being added to the regexp:
      my $regexp = join('|', map quotemeta, keys %temphash);
    3. If the number of words to match can get big, it might be worthy to find a way to construct a cleverer regexp from this wordlist. There's a useful module on CPAN that does just that: Regex::PreSuf. To quote the description from the docs:
      The presuf() subroutine builds regular expressions out of 'word lists', lists of strings. The regular expression matches the same words as the word list. These regular expressions normally run faster than a simple-minded '|'-concatenation of the words.
      The larger the wordlist, the higher the gain likely will be.
      bart,
      I have found cases where no matter how good you made the RE, if there were a lot of things to match and the data to match on was also large - it was never faster than matching each piece individually. I offer this alternative - which uses your suggestions.
      #!/usr/bin/perl -w use strict; my %temphash = ( 't[0-3]om' => "good", "dick" => "bad", "hary" => "ugl +y" ); my $temp = 'This is a string containing t[0-3]om and dick'; my @results = map {my $var = quotemeta $_;$temp =~ /\b($var)\b/} keys +%temphash; print "$_ matched\n" foreach(@results);
      In this very small example, I do not know if this method would be faster - but hey - it is nice to have in the tool box.

      Cheers - L~R

      Update: See this for an example of how to combine this method with a sub to make the process even faster if you have to reuse it.

      Thanks a lot guys
      just one question though, is'\b' suppose to match 'tom' in 'stomach' ?(I m not sure about this) but isn't '\b' suppose to match things ending in word boundries ?
        Yes, ending and beginning with word boundaries, i.e. a transition from a word character (letter, digit or underscore) to either a non-word character, or begin or end of the whole string. /\btom\b/ will match in "foo-tom-bar" but not in "tom33".

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: note [id://255292]
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others chanting in the Monastery: (4)
    As of 2024-03-28 15:56 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found