Closest matches from string array

Baz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Closest matches from string array by tachyon (Chancellor) on Oct 27, 2003 at 12:21 UTC
Soundex is a good keyword to search for and there is Text::Soundex to make life easy. You would want to add a soundex field to your DB, index it then soundex the search term. Fuzzy matching on names is something it does pretty well (for English). You may also find Re: Duplicate detection (SQL) and the associated threads like Module for comparing text and Closest match Display of interest for a number of other ways to do approximate matching. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Closest matches from string array by tbone1 (Monsignor) on Oct 27, 2003 at 12:41 UTC
I'll back up what tachyon said about Text::Soundex, with one caveat: it depends on your data. I looked into this for something at work, but as we are a medical company, some of the terminology is ... how do I put this ... not phonetically straightforward. There are a lot of Greek and Latin chunks in our terms, sometimes Greek and Latin and English components are intermingled in the same word. And English is still more or less a Germanic language. Sorta. If you are dealing with last names, as your example implies, those can vary even more widely. I would suggest that you familiarize yourself with the data, if possible, before you make a decision, but 1) we all know to do that whatever the project, and 2) it isn't always possible. -- tbone1 Ain't enough 'O's in 'stoopid' to describe that guy. - Dave "the King" Wilson	[reply]
Re: Re: Re: Closest matches from string array by tachyon (Chancellor) on Oct 27, 2003 at 13:05 UTC
Coming from a medical background I know just what you mean. Not only that mish mash of Greek and Latin but nowadays the drug company marketing guys just make stuff up cause it sounds good. Loperamide, Ondansetron, Vasocardil, Isocover, Vancomycin, Tofranil, Xanax, Celebrex and who could have missed Viagra (good for spam at least it seems :-) cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Closest matches from string array by tachyon (Chancellor) on Oct 27, 2003 at 12:55 UTC
Cause it's fun (aka I like fuzzy logic)..... use Text::Soundex; my @names = qw( McGee MacGee Magee MacGeady Mackintosh McIntosh Griffin Griffith Griffis Griffey Grifferty McGrifferty O'Griffey O'Griffin ); my %hash; $hash{$_} = soundex($_) for @names; printf "%-15s => %s\n", $_, $hash{$_} for sort keys %hash; my @tests = qw( Griffin McGee McGinley Smith ); for my $name( @tests ) { my $soundex = soundex($name); # you can make the search fuzzy in different ways..... my $bit_fuzzy = substr $soundex, 0, 2; my $mid_fuzzy = substr $soundex, 1, 2; print "\nTesting $name ($soundex) ($bit_fuzzy) ($mid_fuzzy)\n\n"; for my $test( keys %hash ) { print "\t$test\n" if $hash{$test} eq $soundex; } print $/; for my $test( keys %hash ) { print "\t$test (bit fuzzy)\n" if $hash{$test} =~ m/$bit_fuzzy. +./; } print $/; for my $test( keys %hash ) { print "\t$test (mid fuzzy)\n" if $hash{$test} =~ m/.$mid_fuzzy +./; } print $/; } __DATA__ Grifferty => G616 Griffey => G610 Griffin => G615 Griffis => G612 Griffith => G613 MacGeady => M230 MacGee => M200 Mackintosh => M253 Magee => M200 McGee => M200 McGrifferty => M261 McIntosh => M253 O'Griffey => O261 O'Griffin => O261 Testing Griffin (G615) (G6) (61) Griffin Griffin (bit fuzzy) Griffis (bit fuzzy) Grifferty (bit fuzzy) Griffith (bit fuzzy) Griffey (bit fuzzy) Griffin (mid fuzzy) Griffis (mid fuzzy) Grifferty (mid fuzzy) Griffith (mid fuzzy) Griffey (mid fuzzy) Testing McGee (M200) (M2) (20) McGee MacGee Magee McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) McGee (mid fuzzy) MacGee (mid fuzzy) Magee (mid fuzzy) Testing McGinley (M254) (M2) (25) McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) Mackintosh (mid fuzzy) McIntosh (mid fuzzy) Testing Smith (S530) (S5) (53) [download] As you will note from the results the answers are close to what you want. They do highlight some logical issues in your thinking. In one case you effectively demand ignoring the first letter, in the next you assign it importance. No matter what you use for approximation things will get weird if the first letter is WRONG as a few of the mid_fuzzy results show. Also note that if you go for a 2 digit fuzzy match as shown then you will (on average) pull 1/1010 ie 1% of your DB every time for mid_fuzzy and 1/2610 ie 0.4% or your DB with the 2 char bit fuzzy. If you only have 1000 records this is probably not a problem. It is a problem if you have 100,000 - 1,000,000 odd records as a list of 1000-10,000 possibilities is a bit overwhelming. If you want a fuzzy search you would typically have a fuzzometer to let the client make the search progressively more fuzzy the more desperate they get to find whatever! Combining two concurrent fuzzy searches is a potent technique. Even if each search pulls 1% of the DB the union will be much smaller so you will typically reduce the result set by 1,2,3 orders of magnitude. Fuzzy last name and initial would reduce the result set by a factor of roughly 26 (more like 20 but you get the idea). cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Closest matches from string array by Taulmarill (Deacon) on Oct 27, 2003 at 12:20 UTC
you could calculate the "similarity" using String::Similarity.	[reply]
Re: Closest matches from string array by Abigail-II (Bishop) on Oct 27, 2003 at 12:18 UTC
This is a very open question, and it's not very well defined (you say '50% of the letters', but "McGinley" and "O'Griffin" both have 8 letters, but share only 3). A discussion of what is a good match and what isn't is not Perl related, so I won't go into that. However, you might want to look at the 'Text::Soundex' module, and see whether that will be of any help to you. Why this modules comes with the standard distribution is a mystery to me, but it makes it easy to check out. Abigail	[reply]
Re: Re: Closest matches from string array by Arbogast (Monk) on Oct 27, 2003 at 13:49 UTC
This one has worked for me, and allows control over amount of approximate matching. String::Approx	[reply]
Re: Closest matches from string array by BrowserUk (Patriarch) on Oct 27, 2003 at 14:47 UTC
Soundex is probably your kiddy for this, but you might get better results if you stripped common prefixes like Mc, Mac, O' and maybe a few others. That would probably allow you to tighten the match and still not miss likely hits. You said this was coming from a DB. You ought to consider storing the Soundex values within the table as an alternate key. That would save regenerating the Soundex each time, and allow you to use SQL to only pull the likely candidates, rather than pulling them all and doing the selection yourself. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray!	[reply]
Re: Re: Closest matches from string array by tachyon (Chancellor) on Oct 28, 2003 at 11:36 UTC
Aye laddie, the heathen Scots ave a lot te answer fur. 'Mace' =~ s/ma?c/i :-) Oi who is 'e' cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Re: Closest matches from string array by BrowserUk (Patriarch) on Oct 28, 2003 at 12:32 UTC
I was thinking of something along the lines of `$soundex{ $name } = soundex($1) if $name =~ m[^(?:O'\|Mac\|Mc)([A-Z].$)];` [download] Which would only work if the list is nicely capitalised, which going by the few BazB posted, it seems* that his data might be. The idea was that a search for Connor or Keefe, would also find O'Connor and O'Keefe, which using unassisted soundex wouldn't find. #! perl -slw use strict; use Text::Soundex; printf "%20s : %s : %s\n", $_, m[^(?:O'\|Mac\|Mc)([A-Z].*$)] ? soundex($1) : soundex($_), so +undex($_) for qw[ Connor O'Connor Keefe O'Keefe MacDonald McDonald Donald Donaldson O'Donnell Donagal O'Dona +gal ]; __END__ P:\test>test Connor : C560 : C560 O'Connor : C560 : O256 Keefe : K100 : K100 O'Keefe : K100 : O210 MacDonald : D543 : M235 McDonald : D543 : M235 Donald : D543 : D543 Donaldson : D543 : D543 O'Donnell : D540 : O354 Donagal : D524 : D524 O'Donagal : D524 : O352 [download] The first column of soundex codes are the assisted ones, the second unassisted. You can see what a difference it makes. That said. It would screw up sound alikes Magee and MacGee, so it may not be such a good idea. Another possibility would be to store two codes for some names, but then you move into needing to normalise the soundex codes into another table. Which wouldn't be a bad thing, but does add complication. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray!	[reply] [d/l] [select]
Re: Re: Re: Re: Closest matches from string array by tachyon (Chancellor) on Oct 28, 2003 at 15:27 UTC


Your skill will accomplish what the force of many cannot
	PerlMonks