Duplicate (similarity) detection (SQL)

zachlipton has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Duplicate detection (SQL) by gmax (Abbot) on Oct 20, 2003 at 08:50 UTC
As tachyon suggests, you should deal with exact matches using a UNIQUE index. For "similar" records, the most appropriate strategy depends on the nature of your data. There are cases where a part of the record can be duplicated. People have the same first name, or the same surname, or both, but two people with the same name, surname, middle name, and date of birth are suspicious. Products may have the same name, but if they have also the same manufacturer and the same price, they could be an input mistake. However, checking for identical fields is not enough. You must also account for spelling and common inversion mistakes. Without entering into details of how I deal with the problem in a specific area, I give you the general strategy. In this sequence of checking steps, you go from one step to the next unless you get a hit. If all the steps fail, then you should not have a duplicate. Search for an exact match, or let a UNIQUE index catch the mistakes. `# User input my ($name,$middle,$surname) = ('George', 'Washington', 'Clark'); my $query = " SELECT whatever FROM mytable WHERE name = ? AND middle = ? AND surname = ?"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results` [download] Merge your insertion fields together and search the database for a concatenation of the same fields. `my $combinedquery = " SELECT whatever FROM mytable WHERE CONCAT(name,middle,surname) = ?"; my $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname"); # check if you got any results` [download] What's the reason for concatenation? Because this way you can catch an error od somebody entering "George Washington", "", "Clark" or "George", "", "Washington Clark" for name, middle and surname. This technique may not be applicable to a 'products' table, but keep it in mind, just in case. Repeat the above pass with a different combination that you think is likely to happen (concat(surname,name,middle) , concat(middle,name,surname) and so on) Repeat steps 1-3 using SOUNDEX instead of bare values. `my $query = " SELECT whatever FROM mytable WHERE SOUNDEX(name) = SOUNDEX(?) AND SOUNDEX(middle) = SOUNDEX(?) AND SOUNDEX(surname) = SOUNDEX(?)"; my $sth = $dbh->prepare($query); $sth->execute>($name, $middle, $surname); # check if you got any results # .... my $combinedquery = " SELECT whatever FROM mytable WHERE SOUNDEX(CONCAT(name,middle,surname)) = SOUNDEX(?)"; $sth = $dbh->prepare($combinedquery); $sth->execute>("$name$middle$surname");` [download] This way, most of the checking is done by the database engine. Your code should only prepare the needed queries with the appropriate combinations. Notice that there is no guarantee that you get all the possible duplicates. It depends on what you want to catch and how much free rein you want to give to your users. YMMV. _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply] [d/l] [select]
Re: Duplicate detection (SQL) by tachyon (Chancellor) on Oct 20, 2003 at 06:48 UTC
You will find the currently unpublished module Algorithm::HowSimilar at Re: Module for comparing text. It leverages Algorithm::Diff which is just awesome for all sorts of stuff like this. Also perhaps Re: Closest match Display may be of interest. There is plenty of discussion of the options in the associated threads. To stop straight dups use primary keys or unique indexes and let the RDBMS stop it. All you have to do then is catch the exception and do whatever. While the problem as you describe it may have a solution of sorts using some of these tools the real issue is how you aviod iterating over ALL the existing data to make sure the new data 'is similar/not too similar' whatever that means. Not the same, the same are easy. The problem with similarity is that while you can check for it using the approaches above you need to do it every time against the set of data in the DB. In short it won't scale unless you can hone down what you want more preciesly and preferably work out a hashing algorithm that effectively distills a structure into a form you can INDEX. This is effectively really a problem in the search space. It may be that you need to index your data and use the new data as the 'search terms' in some way. If you get a close match (in search terms) then.... cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Duplicate detection (SQL) by serich (Sexton) on Oct 20, 2003 at 18:03 UTC
Some people have suggested normalizing the DB, but I believe the question is not finding duplicates in a database, however just similarities. I.E. Sales: Customer calls and uses Bob as his first name instead of Robert, or gives a different office number but at the same address. The sales person should be able to key in his current information, and see that there are a few possible matches so they can update his old information rather than entering new information and having Bob and Robert as multiple entries in the DB. So after all that I don't have a solution, but I'd be interested in what everyone else comes up with. ~Erich	[reply]
Re: Re: Duplicate detection (SQL) by zachlipton (Beadle) on Oct 22, 2003 at 18:01 UTC
I agree. Calling this node "duplicate detection" was actually a bit of a misnomer (duplicate is the term we use in our system to describe records that are similar enough that they should be marked as duplicate) so I renamed it to "similarities detection." I'm not sure if this will be possible to do at all because running a Search::Similarities search on every entry in the db (there could be thousands in a large database) against the target query would likely take far too long. I may try it in a quick spike solution and see if it's reasonable, but does anyone else have any better ideas?	[reply]
Re: Duplicate detection (SQL) by zakzebrowski (Curate) on Oct 20, 2003 at 12:15 UTC
Also see the various Digest:: modules on cpan. (Not a pure sql solution.) Given a (string \| binary data \| undef) returns a unique* string which is one way unique to that string. (One way meaning that you cannot determine what the content was from the digest...) So just check to see if the digests are the same for various messages and you're done... * assuming the digest method works. Some are better than others (Md5 versus md4), and others are open source based versus propiatary algorithims (md5 versus sha)... ---- Zak `undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;` [download]	[reply] [d/l]
Re: Re: Duplicate detection (SQL) by hartwig (Sexton) on Oct 20, 2003 at 13:10 UTC
I assume to check on database level is the best solution but it is not working for everybody, e.g. if the database is popultated already. In that case you have to normalize the data (eg. the given adress: 24 thompsonrd., 10-03 is going to be normalized to -> thompsonroad 24, level 24, unitnumber 3 ...) and eventually validate the data. To do that properly you can easily spend a few hours :) Then it becomes quite easy to check for duplicates: data{"normalzed key"} =+ 1 Cheers Hartwig	[reply]


Perl-Sensitive Sunglasses
	PerlMonks