Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: Duplicate detection (SQL)

by zakzebrowski (Curate)
on Oct 20, 2003 at 12:15 UTC ( #300541=note: print w/replies, xml ) Need Help??

in reply to Duplicate (similarity) detection (SQL)

Also see the various Digest:: modules on cpan. (Not a pure sql solution.) Given a (string | binary data | undef) returns a unique* string which is one way unique to that string. (One way meaning that you cannot determine what the content was from the digest...) So just check to see if the digests are the same for various messages and you're done...
* assuming the digest method works. Some are better than others (Md5 versus md4), and others are open source based versus propiatary algorithims (md5 versus sha)...

undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;

Replies are listed 'Best First'.
Re: Re: Duplicate detection (SQL)
by hartwig (Sexton) on Oct 20, 2003 at 13:10 UTC
    I assume to check on database level is the best solution but it is not working for everybody, e.g. if the database is popultated already. In that case you have to normalize the data (eg. the given adress: 24 thompsonrd., 10-03 is going to be normalized to -> thompsonroad 24, level 24, unitnumber 3 ...) and eventually validate the data. To do that properly you can easily spend a few hours :) Then it becomes quite easy to check for duplicates:
    data{"normalzed key"} =+ 1
    Cheers Hartwig

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://300541]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2020-07-06 21:43 GMT
Find Nodes?
    Voting Booth?

    No recent polls found