Fuzzy matching of text strings

srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fuzzy matching of text strings by planetscape (Chancellor) on Dec 14, 2005 at 14:12 UTC
Super Search is your friend. Here are a few nodes that may be of interest: Finding a _Similar_ Substring? (Fuzzy Searching?) Fuzzy Searching: Optimizing Algorithm Selection Fuzzy searching Fuzzy Strings Regexp::Approx - Use fuzzy regular expressions similar string matching Similarity of strings text string approxiamtions (concept for review) HTH, planetscape	[reply]
Re: Fuzzy matching of text strings by samizdat (Vicar) on Dec 14, 2005 at 13:46 UTC
String::Approx on CPAN, or the ported (UN!X) app agrep. :D Don Wilde "There's more than one level to any answer."	[reply]
Re: Fuzzy matching of text strings by ww (Archbishop) on Dec 14, 2005 at 14:35 UTC
Above represent good advice, but it may be profitable (efficient) to normalize the capitalization before dealing with the more complex fuzzy matching needs found in item 3. I would be seriously inclined to see if lc'ing everything, and then uc'ing first letter of each word minimizes the work. However, this scheme is suggested on the basis of one snippet of your data; if you have to distinguish between Mr. MacHinery and (something) Machinery OR if capitalization on the output need be not only consistent but also "correct" -- for unknown values of correct -- you will need something far better than this simple-minded scheme.	[reply]
Re: Fuzzy matching of text strings by TedPride (Priest) on Dec 14, 2005 at 19:38 UTC
Remove all leading and trailing whitespace; change all instances of more than one whitespace character to a space; lowercase; remove punctuation. Then create a 3-character frequency count and match it against the 3-character frequency count for existing fields. If the percentage match is over a certain level, add possible matches to an array. Sort by percentage match. Now either choose a record (or field) to merge with, or mark the record as unique and move on. Hopefully you won't have to manually deal with too many records.	[reply]
Re: Fuzzy matching of text strings by qq (Hermit) on Dec 14, 2005 at 16:24 UTC
The String::Approx way may not work well for addresses. You could attempt to normalize them before doing a comparison: eg with Geo::StreetAddress::US or the like. I've never actually tried this, and I can imagine its not always straightforward. But you might end up with better data. update:typo and removed redundant sentance.	[reply]
Re: Fuzzy matching of text strings by ruoso (Curate) on Dec 14, 2005 at 17:19 UTC
I had a similar problem, which made me write String::Compare... For some reason (that obviously I don't remember) I didn't use String::Approx... daniel	[reply]
Re^2: Fuzzy matching of text strings by srdst13 (Pilgrim) on Dec 14, 2005 at 18:11 UTC
Thanks all for the answers. Finding the similarity between two strings is one (probably the largest) component of my problem. However, there is another component--finding groups of "matches". I guess that I could do all possible pairs and look for similarity between them, forming a graph-like structure connecting "matches" to each other and then look for disconnected components or some such thing. Any thoughts on this second part of the problem? There are any number of possible ways to do it in practice (Graph.pm or even SQL could probably handle it), but it would be great to hear thoughts on the issue. Thanks again, Sean	[reply]
Re^3: Fuzzy matching of text strings by ruoso (Curate) on Dec 16, 2005 at 17:15 UTC
In fact, the process of developing each of the test subroutines was based on the results of the comparision using a subset of the data. What I did, in that case, was continuosly creating new tests and outputting to a csv file A, B and the comparision score. I stopped when I got a good result of both a limit score and having few false positives and false negatives. I think you could do it in the same way, no need for anything much sofisticated, just a subset of the database and many runs improving the type of tests you make. daniel	[reply]
Re: Fuzzy matching of text strings by Anonymous Monk on Dec 14, 2005 at 18:26 UTC
I'm not sure about other databases, but Microsft SQL Server has a function called SOUNDEX, which returns a "similarity" string that you can use for comparisons. e.g. `SELECT description, SOUNDEX(description) AS SOUNDEX FROM mytable description SOUNDEX -------------------------------------------- ------- Promessa High Spirits P652 Promessa high Spirits P652 Promessa Hgh Spirits P652 Promessa high spirits P652 And now for something completely different A530 And now, for something completely different A530 And now! Something completely different A530 Something completely different now S535` [download] Description of SOUNDEX from manual: SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers. Vowels in character_expression are ignored unless they are the first letter of the string. String functions can be nested.	[reply] [d/l]
Re^2: Fuzzy matching of text strings by buttroast (Scribe) on Dec 14, 2005 at 19:22 UTC
Soundex is a great tool, but in this case it is not doing anything. The reason the first four descriptions in your sample return the same soundex code is because they only processed the "Promess" portion of each record. Basically: 1. Grab the first letter: String: Promessa H... Soundex: P 2. Remove all vowels in remaining string: String: rmssH Soundex: P 3. Condense duplicate letters: String: rmsH Soundex: P 4. Assign 3 digits from l-r based on following key: 1. b,p,f,v 2. c,s,k,g,i,q,x,z 3. d,t 4. l 5. m,n 6. r String: rmsH Soundex: P6 (6 is for r) String: msH Soundex: P65 (5 is for m) String: sH Soundex: P652 (2 is for s) DONE AT 3 DIGITS!!! GO NO FURTHER. If there are consecutive characters from the same group, such as in the name "Duck", (c and k are both in group 2), the resulting soundex would be D200 (zeros are added to pad right if we run out of letters to change to numbers). In summary, soundex is not appropriate for longer strings comparison. If you use it, the following would all be grouped as P652: Promessa National Bank Promessing Fertilizer Company Promessa High Spirits Promessing With Me Hope this clears up Soundex for everyone. Thanks buttroast	[reply]
Re^2: Fuzzy matching of text strings by GrandFather (Saint) on Dec 15, 2005 at 03:08 UTC
Soundex was designed for matching names and is not very usefull as a general tool. Abbreviate english words may help generating a "canonical" form of a phrase. DWIM is Perl's answer to Gödel	[reply]
Re: Fuzzy matching of text strings by nothingmuch (Priest) on Dec 15, 2005 at 16:03 UTC
50 thousand is better than several million... If I were you my strategy would be: Normalize capitalization: `lc` Normalize white space: `s/^\s+\|\s+$//g; s/\s+/ /g` Save all strings which are normalized into the same thing in a single place (tied MLDBM or in memory hash, `push @{ $db{$normalized} }, $original`) use Digest::Nilsimsa to hash all the keys, index using e.g. this method, as discussed in this thread (the original author may have more insight). When indexing this way, chunk the nilsimsa hash into something which will allow you to bunch items together. iterate the nilsimsa keys, and display any group of original texts whose edit distance (String::Approx) is too great. The user can then decide whether these are duplicate or not. Entries with only one item are, ofcourse, omitted. Employ a system where human categorized items are remembered, undos are easy to increase the efficiency of the human assisted process A technique like Apple's Aperture's stack feature (see the shiny publicity videos) can be used - readkey on two hotkeys, and use them to increase or decrease the tolerance of the nilsima hash difference or edit distance, in order to partition into groups easily. For each chunk of computer or human identified duplicates find the canonical version you would like to use, and insert it into the database. -nuffin zz zZ Z Z #!perl	[reply] [d/l] [select]
Re: Fuzzy matching of text strings by Moron (Curate) on Dec 15, 2005 at 15:20 UTC
(untested) (updated to add more transformations) #!/usr/bin/perl # usage: $0 [-d "delimiter"] file use strict; use warnings; my $delimiter = '\t'; if ( $ARGV[0] =~ /^\-(.)$/ ) { $_ = $ARGV[1]; if ( /^\"(.)\"$/ ) { $delimiter = $1; shift @ARGV; shift @ARGV; } } my $file = join( '', @ARGV ); my %seen = (); open my $fh, "<$file" or die "$!: $file\n"; while( <$fh> ) { chomp; my @cols = split( /$delimiter/ ); for ( my $col=0; $col <= $#cols; $col++ ) { Check ( \%seen, $col[$col], $., $col+1 ); } } close $fh; sub Check { my ( $sref, $col, $lno, $colno ) = @_; my $lc = Transform( $col ); my $xref; if ( $xref = $sref -> { $lc } ) { ( $sref -> { $lc }{ $col } ) or print "Column $colno, Line $lno: ($col) " . "varies from " . ( values %$xref ); } else { $sref -> { $lc }{ $col } = "Column $colno, Line $lno: ($col)\n +"; } } sub Transform{ my $lc = lc( shift()); # example of an additional transformation... $lc =~ s/\s//g; return $lc; } [download] -M Free your mind	[reply] [d/l]


more useful options
	PerlMonks