Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Duplicate (similarity) detection (SQL)

by zachlipton (Beadle)
on Oct 20, 2003 at 04:35 UTC ( [id://300470]=perlquestion: print w/replies, xml ) Need Help??

zachlipton has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a perl-based CGI frontend to a MySQL database and am looking for a good way to do duplicate detection. While it's easy to do a simple match on one field, what I'm looking to do is use a more complex algorithm to compare several different fields and produce a sorted list of database records that are most likely to be similar. False positives are fine, this is more to present the user with a list of 5 or 10 possibilities they can scan to see if the record they are adding is similar to one already in the database.

While I've found some academic papers on the subject that look promising, I'm looking for something that while not readymade, the algorithm is at least demonstrated in perl.

Modules like String::Similarities look good, does anyone have any recomendations on other tools for this type of work?

Replies are listed 'Best First'.
Re: Duplicate detection (SQL)
by gmax (Abbot) on Oct 20, 2003 at 08:50 UTC

    As tachyon suggests, you should deal with exact matches using a UNIQUE index.

    For "similar" records, the most appropriate strategy depends on the nature of your data. There are cases where a part of the record can be duplicated. People have the same first name, or the same surname, or both, but two people with the same name, surname, middle name, and date of birth are suspicious. Products may have the same name, but if they have also the same manufacturer and the same price, they could be an input mistake.

    However, checking for identical fields is not enough. You must also account for spelling and common inversion mistakes.

    Without entering into details of how I deal with the problem in a specific area, I give you the general strategy. In this sequence of checking steps, you go from one step to the next unless you get a hit. If all the steps fail, then you should not have a duplicate.

    1. Search for an exact match, or let a UNIQUE index catch the mistakes.
      # User input my ($name,$middle,$surname) = ('George', 'Washington', 'Clark'); my $query = " SELECT whatever FROM mytable WHERE name = ? AND middle = ? AND surname = ?"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results
    2. Merge your insertion fields together and search the database for a concatenation of the same fields.
      my $combinedquery = " SELECT whatever FROM mytable WHERE CONCAT(name,middle,surname) = ?"; my $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname"); # check if you got any results

      What's the reason for concatenation? Because this way you can catch an error od somebody entering "George Washington", "", "Clark" or "George", "", "Washington Clark" for name, middle and surname. This technique may not be applicable to a 'products' table, but keep it in mind, just in case.

    3. Repeat the above pass with a different combination that you think is likely to happen (concat(surname,name,middle) , concat(middle,name,surname) and so on)
    4. Repeat steps 1-3 using SOUNDEX instead of bare values.
      my $query = " SELECT whatever FROM mytable WHERE SOUNDEX(name) = SOUNDEX(?) AND SOUNDEX(middle) = SOUNDEX(?) AND SOUNDEX(surname) = SOUNDEX(?)"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results # .... my $combinedquery = " SELECT whatever FROM mytable WHERE SOUNDEX(CONCAT(name,middle,surname)) = SOUNDEX(?)"; $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname");

    This way, most of the checking is done by the database engine. Your code should only prepare the needed queries with the appropriate combinations.

    Notice that there is no guarantee that you get all the possible duplicates. It depends on what you want to catch and how much free rein you want to give to your users.

    YMMV.

     _  _ _  _  
    (_|| | |(_|><
     _|   
    
Re: Duplicate detection (SQL)
by tachyon (Chancellor) on Oct 20, 2003 at 06:48 UTC

    You will find the currently unpublished module Algorithm::HowSimilar at Re: Module for comparing text. It leverages Algorithm::Diff which is just awesome for all sorts of stuff like this. Also perhaps Re: Closest match Display may be of interest. There is plenty of discussion of the options in the associated threads.

    To stop straight dups use primary keys or unique indexes and let the RDBMS stop it. All you have to do then is catch the exception and do whatever.

    While the problem as you describe it may have a solution of sorts using some of these tools the real issue is how you aviod iterating over ALL the existing data to make sure the new data 'is similar/not too similar' whatever that means.

    Not the same, the same are easy. The problem with similarity is that while you can check for it using the approaches above you need to do it every time against the set of data in the DB. In short it won't scale unless you can hone down what you want more preciesly and preferably work out a hashing algorithm that effectively distills a structure into a form you can INDEX.

    This is effectively really a problem in the search space. It may be that you need to index your data and use the new data as the 'search terms' in some way. If you get a close match (in search terms) then....

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Duplicate detection (SQL)
by serich (Sexton) on Oct 20, 2003 at 18:03 UTC
    Some people have suggested normalizing the DB, but I believe the question is not finding duplicates in a database, however just similarities. I.E. Sales: Customer calls and uses Bob as his first name instead of Robert, or gives a different office number but at the same address. The sales person should be able to key in his current information, and see that there are a few possible matches so they can update his old information rather than entering new information and having Bob and Robert as multiple entries in the DB. So after all that I don't have a solution, but I'd be interested in what everyone else comes up with.

    ~Erich
      I agree. Calling this node "duplicate detection" was actually a bit of a misnomer (duplicate is the term we use in our system to describe records that are similar enough that they should be marked as duplicate) so I renamed it to "similarities detection."

      I'm not sure if this will be possible to do at all because running a Search::Similarities search on every entry in the db (there could be thousands in a large database) against the target query would likely take far too long. I may try it in a quick spike solution and see if it's reasonable, but does anyone else have any better ideas?

Re: Duplicate detection (SQL)
by zakzebrowski (Curate) on Oct 20, 2003 at 12:15 UTC
    Also see the various Digest:: modules on cpan. (Not a pure sql solution.) Given a (string | binary data | undef) returns a unique* string which is one way unique to that string. (One way meaning that you cannot determine what the content was from the digest...) So just check to see if the digests are the same for various messages and you're done...
    * assuming the digest method works. Some are better than others (Md5 versus md4), and others are open source based versus propiatary algorithims (md5 versus sha)...


    ----
    Zak
    undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;
      I assume to check on database level is the best solution but it is not working for everybody, e.g. if the database is popultated already. In that case you have to normalize the data (eg. the given adress: 24 thompsonrd., 10-03 is going to be normalized to -> thompsonroad 24, level 24, unitnumber 3 ...) and eventually validate the data. To do that properly you can easily spend a few hours :) Then it becomes quite easy to check for duplicates:
      data{"normalzed key"} =+ 1
      Cheers Hartwig

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://300470]
Approved by blokhead
Front-paged by cchampion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-26 07:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found