Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Fuzzy matching of text strings

by srdst13 (Pilgrim)
on Dec 14, 2005 at 13:35 UTC ( #516632=perlquestion: print w/replies, xml ) Need Help??

srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

I have a legacy database (actually a single fully de-normalized table with MANY columns) that I need to move over to SQL and a more orderly data structure. However, much of the data was entered as free-text, so I have columns of things such as names, cities, etc. that vary slightly depending on who did the entering. In the simplest cases, extra white space or different capitalization are used. In more complex cases, nouns are spelled differently or even a word (like a middle name) is left out. So, I need to somehow consolidate these textual variations describing the same thing into a single entity. Concrete example:
Promessa High Spirits
Promessa high Spirits
Promessa Hgh Spirits
Promessa high spirits
These all represent the same animal. I would like to "discover" that (or at least be alerted to the situation) without having to go through the 50000 records like this that I have.

Thanks,
Sean

Replies are listed 'Best First'.
Re: Fuzzy matching of text strings
by planetscape (Chancellor) on Dec 14, 2005 at 14:12 UTC
Re: Fuzzy matching of text strings
by samizdat (Vicar) on Dec 14, 2005 at 13:46 UTC
    String::Approx on CPAN, or the ported (UN!X) app agrep. :D

    Don Wilde
    "There's more than one level to any answer."
Re: Fuzzy matching of text strings
by ww (Archbishop) on Dec 14, 2005 at 14:35 UTC
    Above represent good advice, but it may be profitable (efficient) to normalize the capitalization before dealing with the more complex fuzzy matching needs found in item 3.

    I would be seriously inclined to see if lc'ing everything, and then uc'ing first letter of each word minimizes the work.

    However, this scheme is suggested on the basis of one snippet of your data; if you have to distinguish between Mr. MacHinery and (something) Machinery *OR* if capitalization on the output need be not only consistent but also "correct" -- for unknown values of correct -- you will need something far better than this simple-minded scheme.

Re: Fuzzy matching of text strings
by TedPride (Priest) on Dec 14, 2005 at 19:38 UTC
    Remove all leading and trailing whitespace; change all instances of more than one whitespace character to a space; lowercase; remove punctuation. Then create a 3-character frequency count and match it against the 3-character frequency count for existing fields. If the percentage match is over a certain level, add possible matches to an array. Sort by percentage match.

    Now either choose a record (or field) to merge with, or mark the record as unique and move on. Hopefully you won't have to manually deal with too many records.

Re: Fuzzy matching of text strings
by qq (Hermit) on Dec 14, 2005 at 16:24 UTC

    The String::Approx way may not work well for addresses. You could attempt to normalize them before doing a comparison: eg with Geo::StreetAddress::US or the like.

    I've never actually tried this, and I can imagine its not always straightforward. But you might end up with better data.

    update:typo and removed redundant sentance.

Re: Fuzzy matching of text strings
by ruoso (Curate) on Dec 14, 2005 at 17:19 UTC
    I had a similar problem, which made me write String::Compare... For some reason (that obviously I don't remember) I didn't use String::Approx...
    daniel
      Thanks all for the answers. Finding the similarity between two strings is one (probably the largest) component of my problem. However, there is another component--finding groups of "matches". I guess that I could do all possible pairs and look for similarity between them, forming a graph-like structure connecting "matches" to each other and then look for disconnected components or some such thing. Any thoughts on this second part of the problem? There are any number of possible ways to do it in practice (Graph.pm or even SQL could probably handle it), but it would be great to hear thoughts on the issue.

      Thanks again,
      Sean
        In fact, the process of developing each of the test subroutines was based on the results of the comparision using a subset of the data. What I did, in that case, was continuosly creating new tests and outputting to a csv file A, B and the comparision score. I stopped when I got a good result of both a limit score and having few false positives and false negatives. I think you could do it in the same way, no need for anything much sofisticated, just a subset of the database and many runs improving the type of tests you make.
        daniel
Re: Fuzzy matching of text strings
by Anonymous Monk on Dec 14, 2005 at 18:26 UTC
    I'm not sure about other databases, but Microsft SQL Server has a function called SOUNDEX, which returns a "similarity" string that you can use for comparisons.

    e.g.
    SELECT description, SOUNDEX(description) AS SOUNDEX FROM mytable description SOUNDEX -------------------------------------------- ------- Promessa High Spirits P652 Promessa high Spirits P652 Promessa Hgh Spirits P652 Promessa high spirits P652 And now for something completely different A530 And now, for something completely different A530 And now! Something completely different A530 Something completely different now S535
    Description of SOUNDEX from manual:
    SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers. Vowels in character_expression are ignored unless they are the first letter of the string. String functions can be nested.
      Soundex is a great tool, but in this case it is not doing anything. The reason the first four descriptions in your sample return the same soundex code is because they only processed the "Promess" portion of each record.

      Basically:

      1. Grab the first letter:

      String: Promessa H...
      Soundex: P

      2. Remove all vowels in remaining string:

      String: rmssH
      Soundex: P

      3. Condense duplicate letters:


      String: rmsH
      Soundex: P

      4. Assign 3 digits from l-r based on following key:

      1. b,p,f,v
      2. c,s,k,g,i,q,x,z
      3. d,t
      4. l
      5. m,n
      6. r

      String: rmsH
      Soundex: P6 (6 is for r)

      String: msH
      Soundex: P65 (5 is for m)

      String: sH
      Soundex: P652 (2 is for s)

      DONE AT 3 DIGITS!!! GO NO FURTHER.

      If there are consecutive characters from the same group, such as in the name "Duck", (c and k are both in group 2), the resulting soundex would be D200 (zeros are added to pad right if we run out of letters to change to numbers).

      In summary, soundex is not appropriate for longer strings comparison. If you use it, the following would all be grouped as P652:

      Promessa National Bank
      Promessing Fertilizer Company
      Promessa High Spirits
      Promessing With Me

      Hope this clears up Soundex for everyone.

      Thanks buttroast

      Soundex was designed for matching names and is not very usefull as a general tool. Abbreviate english words may help generating a "canonical" form of a phrase.


      DWIM is Perl's answer to Gödel
Re: Fuzzy matching of text strings
by nothingmuch (Priest) on Dec 15, 2005 at 16:03 UTC

    50 thousand is better than several million... If I were you my strategy would be:

    1. Normalize capitalization: lc
    2. Normalize white space: s/^\s+|\s+$//g; s/\s+/ /g
    3. Save all strings which are normalized into the same thing in a single place (tied MLDBM or in memory hash, push @{ $db{$normalized} }, $original)
    4. use Digest::Nilsimsa to hash all the keys, index using e.g. this method, as discussed in this thread (the original author may have more insight). When indexing this way, chunk the nilsimsa hash into something which will allow you to bunch items together.
    5. iterate the nilsimsa keys, and display any group of original texts whose edit distance (String::Approx) is too great. The user can then decide whether these are duplicate or not. Entries with only one item are, ofcourse, omitted.
    6. Employ a system where human categorized items are remembered, undos are easy to increase the efficiency of the human assisted process
    7. A technique like Apple's Aperture's stack feature (see the shiny publicity videos) can be used - readkey on two hotkeys, and use them to increase or decrease the tolerance of the nilsima hash difference or edit distance, in order to partition into groups easily.
    8. For each chunk of computer or human identified duplicates find the canonical version you would like to use, and insert it into the database.
    -nuffin
    zz zZ Z Z #!perl
Re: Fuzzy matching of text strings
by Moron (Curate) on Dec 15, 2005 at 15:20 UTC
    (untested) (updated to add more transformations)
    #!/usr/bin/perl # usage: $0 [-d "delimiter"] file use strict; use warnings; my $delimiter = '\t'; if ( $ARGV[0] =~ /^\-(.)$/ ) { $_ = $ARGV[1]; if ( /^\"(.*)\"$/ ) { $delimiter = $1; shift @ARGV; shift @ARGV; } } my $file = join( '', @ARGV ); my %seen = (); open my $fh, "<$file" or die "$!: $file\n"; while( <$fh> ) { chomp; my @cols = split( /$delimiter/ ); for ( my $col=0; $col <= $#cols; $col++ ) { Check ( \%seen, $col[$col], $., $col+1 ); } } close $fh; sub Check { my ( $sref, $col, $lno, $colno ) = @_; my $lc = Transform( $col ); my $xref; if ( $xref = $sref -> { $lc } ) { ( $sref -> { $lc }{ $col } ) or print "Column $colno, Line $lno: ($col) " . "varies from " . ( values %$xref ); } else { $sref -> { $lc }{ $col } = "Column $colno, Line $lno: ($col)\n +"; } } sub Transform{ my $lc = lc( shift()); # example of an additional transformation... $lc =~ s/\s*//g; return $lc; }

    -M

    Free your mind

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://516632]
Approved by holli
Front-paged by holli
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2022-05-23 01:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (81 votes). Check out past polls.

    Notices?