Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Question: practical way to find dupes in a large dataset

by derby (Abbot)
on Dec 22, 2010 at 12:10 UTC ( [id://878491]=note: print w/replies, xml ) Need Help??


in reply to Question: practical way to find dupes in a large dataset

Hmmm ... well probably not faster but at least more accurate ...

If they're US addresses (and zip code makes me think so), you could use the USPS web service for this. They're are limits (5 requests per transaction) and it's going to be slow -- but at least they'll be correct (especially if you're goal is to *use* the address data to send mail!).

Once the addresses are standardarized -- I would then create a new table where contact_name is not part of the unique constraint and see what happens when you load the data. If it appears the names are mis-spelled or truncated or typo-ed, well, then your biggest problem is which one to choose. If there are multiple distinct names per address and you wish to keep those then I would add them back in *after* the initial load (and after altering the table to put contact_name back in as a unique constraint).

-derby

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://878491]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-28 14:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found