Perl Monk, Perl Meditation | |
PerlMonks |
Re: Question: practical way to find dupes in a large datasetby derby (Abbot) |
on Dec 22, 2010 at 12:10 UTC ( [id://878491]=note: print w/replies, xml ) | Need Help?? |
Hmmm ... well probably not faster but at least more accurate ... If they're US addresses (and zip code makes me think so), you could use the USPS web service for this. They're are limits (5 requests per transaction) and it's going to be slow -- but at least they'll be correct (especially if you're goal is to *use* the address data to send mail!). Once the addresses are standardarized -- I would then create a new table where contact_name is not part of the unique constraint and see what happens when you load the data. If it appears the names are mis-spelled or truncated or typo-ed, well, then your biggest problem is which one to choose. If there are multiple distinct names per address and you wish to keep those then I would add them back in *after* the initial load (and after altering the table to put contact_name back in as a unique constraint).
-derby
In Section
Seekers of Perl Wisdom
|
|