Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Closest matches from string array

by Baz (Friar)
on Oct 27, 2003 at 12:08 UTC ( [id://302372]=perlquestion: print w/replies, xml ) Need Help??

Baz has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a string array of arbitary length of lets say 14 -
McGee
MacGee
Magee
MacGeady
Mackintosh
McIntosh
Griffin
Griffith
Griffis
Griffey
Grifferty
McGrifferty
O'Griffey
O'Griffin
I want to make it possible for the client to enter a name and for perl to return a match and also similar names. So lets say I enter Griffin, I would like -
Griffin
Griffith
Griffis
Griffey
Grifferty
McGrifferty
O'Griffey
O'Griffin
returned. If I enter MacGee, I would have -

McGee
MacGee
Magee

returned. If I entered McGinley I might get -

McGee
MacGee
Magee
MacGeady
Griffin
O'Griffin

returned. Perhaps this search would work by matching say 50% of the letters in the search string, in order maybe. Any suggestions as to how I might do this? I'm extracting the array as a columb from a mysql database.
Thanks,
Barry.

Replies are listed 'Best First'.
Re: Closest matches from string array
by tachyon (Chancellor) on Oct 27, 2003 at 12:21 UTC

    Soundex is a good keyword to search for and there is Text::Soundex to make life easy. You would want to add a soundex field to your DB, index it then soundex the search term. Fuzzy matching on names is something it does pretty well (for English). You may also find Re: Duplicate detection (SQL) and the associated threads like Module for comparing text and Closest match Display of interest for a number of other ways to do approximate matching.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I'll back up what tachyon said about Text::Soundex, with one caveat: it depends on your data. I looked into this for something at work, but as we are a medical company, some of the terminology is ... how do I put this ... not phonetically straightforward. There are a lot of Greek and Latin chunks in our terms, sometimes Greek and Latin and English components are intermingled in the same word. And English is still more or less a Germanic language. Sorta.

      If you are dealing with last names, as your example implies, those can vary even more widely. I would suggest that you familiarize yourself with the data, if possible, before you make a decision, but 1) we all know to do that whatever the project, and 2) it isn't always possible.

      --
      tbone1
      Ain't enough 'O's in 'stoopid' to describe that guy.
      - Dave "the King" Wilson

        Coming from a medical background I know just what you mean. Not only that mish mash of Greek and Latin but nowadays the drug company marketing guys just make stuff up cause it sounds good. Loperamide, Ondansetron, Vasocardil, Isocover, Vancomycin, Tofranil, Xanax, Celebrex and who could have missed Viagra (good for spam at least it seems :-)

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Closest matches from string array
by tachyon (Chancellor) on Oct 27, 2003 at 12:55 UTC

    Cause it's fun (aka I like fuzzy logic).....

    use Text::Soundex; my @names = qw( McGee MacGee Magee MacGeady Mackintosh McIntosh Griffin Griffith Griffis Griffey Grifferty McGrifferty O'Griffey O'Griffin ); my %hash; $hash{$_} = soundex($_) for @names; printf "%-15s => %s\n", $_, $hash{$_} for sort keys %hash; my @tests = qw( Griffin McGee McGinley Smith ); for my $name( @tests ) { my $soundex = soundex($name); # you can make the search fuzzy in different ways..... my $bit_fuzzy = substr $soundex, 0, 2; my $mid_fuzzy = substr $soundex, 1, 2; print "\nTesting $name ($soundex) ($bit_fuzzy) ($mid_fuzzy)\n\n"; for my $test( keys %hash ) { print "\t$test\n" if $hash{$test} eq $soundex; } print $/; for my $test( keys %hash ) { print "\t$test (bit fuzzy)\n" if $hash{$test} =~ m/$bit_fuzzy. +./; } print $/; for my $test( keys %hash ) { print "\t$test (mid fuzzy)\n" if $hash{$test} =~ m/.$mid_fuzzy +./; } print $/; } __DATA__ Grifferty => G616 Griffey => G610 Griffin => G615 Griffis => G612 Griffith => G613 MacGeady => M230 MacGee => M200 Mackintosh => M253 Magee => M200 McGee => M200 McGrifferty => M261 McIntosh => M253 O'Griffey => O261 O'Griffin => O261 Testing Griffin (G615) (G6) (61) Griffin Griffin (bit fuzzy) Griffis (bit fuzzy) Grifferty (bit fuzzy) Griffith (bit fuzzy) Griffey (bit fuzzy) Griffin (mid fuzzy) Griffis (mid fuzzy) Grifferty (mid fuzzy) Griffith (mid fuzzy) Griffey (mid fuzzy) Testing McGee (M200) (M2) (20) McGee MacGee Magee McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) McGee (mid fuzzy) MacGee (mid fuzzy) Magee (mid fuzzy) Testing McGinley (M254) (M2) (25) McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) Mackintosh (mid fuzzy) McIntosh (mid fuzzy) Testing Smith (S530) (S5) (53)

    As you will note from the results the answers are close to what you want. They do highlight some logical issues in your thinking. In one case you effectively demand ignoring the first letter, in the next you assign it importance. No matter what you use for approximation things will get weird if the first letter is WRONG as a few of the mid_fuzzy results show. Also note that if you go for a 2 digit fuzzy match as shown then you will (on average) pull 1/10*10 ie 1% of your DB every time for mid_fuzzy and 1/26*10 ie 0.4% or your DB with the 2 char bit fuzzy. If you only have 1000 records this is probably not a problem. It is a problem if you have 100,000 - 1,000,000 odd records as a list of 1000-10,000 possibilities is a bit overwhelming. If you want a fuzzy search you would typically have a fuzzometer to let the client make the search progressively more fuzzy the more desperate they get to find whatever!

    Combining two concurrent fuzzy searches is a potent technique. Even if each search pulls 1% of the DB the union will be much smaller so you will typically reduce the result set by 1,2,3 orders of magnitude. Fuzzy last name and initial would reduce the result set by a factor of roughly 26 (more like 20 but you get the idea).

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Closest matches from string array
by Taulmarill (Deacon) on Oct 27, 2003 at 12:20 UTC
Re: Closest matches from string array
by Abigail-II (Bishop) on Oct 27, 2003 at 12:18 UTC
    This is a very open question, and it's not very well defined (you say '50% of the letters', but "McGinley" and "O'Griffin" both have 8 letters, but share only 3). A discussion of what is a good match and what isn't is not Perl related, so I won't go into that. However, you might want to look at the 'Text::Soundex' module, and see whether that will be of any help to you.

    Why this modules comes with the standard distribution is a mystery to me, but it makes it easy to check out.

    Abigail

      This one has worked for me, and allows control over amount of approximate matching.

      String::Approx
Re: Closest matches from string array
by BrowserUk (Patriarch) on Oct 27, 2003 at 14:47 UTC

    Soundex is probably your kiddy for this, but you might get better results if you stripped common prefixes like Mc, Mac, O' and maybe a few others. That would probably allow you to tighten the match and still not miss likely hits.

    You said this was coming from a DB. You ought to consider storing the Soundex values within the table as an alternate key. That would save regenerating the Soundex each time, and allow you to use SQL to only pull the likely candidates, rather than pulling them all and doing the selection yourself.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

      Aye laddie, the heathen Scots ave a lot te answer fur. 'Mace' =~ s/ma?c/i :-) Oi who is 'e'

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        I was thinking of something along the lines of

        $soundex{ $name } = soundex($1) if $name =~ m[^(?:O'|Mac|Mc)([A-Z].*$)];

        Which would only work if the list is nicely capitalised, which going by the few BazB posted, it seems that his data might be. The idea was that a search for Connor or Keefe, would also find O'Connor and O'Keefe, which using unassisted soundex wouldn't find.

        #! perl -slw use strict; use Text::Soundex; printf "%20s : %s : %s\n", $_, m[^(?:O'|Mac|Mc)([A-Z].*$)] ? soundex($1) : soundex($_), so +undex($_) for qw[ Connor O'Connor Keefe O'Keefe MacDonald McDonald Donald Donaldson O'Donnell Donagal O'Dona +gal ]; __END__ P:\test>test Connor : C560 : C560 O'Connor : C560 : O256 Keefe : K100 : K100 O'Keefe : K100 : O210 MacDonald : D543 : M235 McDonald : D543 : M235 Donald : D543 : D543 Donaldson : D543 : D543 O'Donnell : D540 : O354 Donagal : D524 : D524 O'Donagal : D524 : O352

        The first column of soundex codes are the assisted ones, the second unassisted. You can see what a difference it makes.

        That said. It would screw up sound alikes Magee and MacGee, so it may not be such a good idea. Another possibility would be to store two codes for some names, but then you move into needing to normalise the soundex codes into another table. Which wouldn't be a bad thing, but does add complication.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://302372]
Approved by gjb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2024-04-19 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found