http://qs321.pair.com?node_id=300509


in reply to Duplicate (similarity) detection (SQL)

As tachyon suggests, you should deal with exact matches using a UNIQUE index.

For "similar" records, the most appropriate strategy depends on the nature of your data. There are cases where a part of the record can be duplicated. People have the same first name, or the same surname, or both, but two people with the same name, surname, middle name, and date of birth are suspicious. Products may have the same name, but if they have also the same manufacturer and the same price, they could be an input mistake.

However, checking for identical fields is not enough. You must also account for spelling and common inversion mistakes.

Without entering into details of how I deal with the problem in a specific area, I give you the general strategy. In this sequence of checking steps, you go from one step to the next unless you get a hit. If all the steps fail, then you should not have a duplicate.

  1. Search for an exact match, or let a UNIQUE index catch the mistakes.
    # User input my ($name,$middle,$surname) = ('George', 'Washington', 'Clark'); my $query = " SELECT whatever FROM mytable WHERE name = ? AND middle = ? AND surname = ?"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results
  2. Merge your insertion fields together and search the database for a concatenation of the same fields.
    my $combinedquery = " SELECT whatever FROM mytable WHERE CONCAT(name,middle,surname) = ?"; my $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname"); # check if you got any results

    What's the reason for concatenation? Because this way you can catch an error od somebody entering "George Washington", "", "Clark" or "George", "", "Washington Clark" for name, middle and surname. This technique may not be applicable to a 'products' table, but keep it in mind, just in case.

  3. Repeat the above pass with a different combination that you think is likely to happen (concat(surname,name,middle) , concat(middle,name,surname) and so on)
  4. Repeat steps 1-3 using SOUNDEX instead of bare values.
    my $query = " SELECT whatever FROM mytable WHERE SOUNDEX(name) = SOUNDEX(?) AND SOUNDEX(middle) = SOUNDEX(?) AND SOUNDEX(surname) = SOUNDEX(?)"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results # .... my $combinedquery = " SELECT whatever FROM mytable WHERE SOUNDEX(CONCAT(name,middle,surname)) = SOUNDEX(?)"; $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname");

This way, most of the checking is done by the database engine. Your code should only prepare the needed queries with the appropriate combinations.

Notice that there is no guarantee that you get all the possible duplicates. It depends on what you want to catch and how much free rein you want to give to your users.

YMMV.

 _  _ _  _  
(_|| | |(_|><
 _|