comment on

Occasionaly I have to de-dup mailing lists we receive from the government. Since my lists often include duplicates of the same address and duplicate names (different addresses) I use a variety of methods.

First I get the list with identical names and cities (uppercaseing both to avoid case issues). I suppose that probably gets false positives but we would prefer that in our case.

Then I take all the addresses and pull ones that the first 7 characters match on. It is pretty hard to have the same first 7 and note be a duplicate. Agian that is with full uppercasing of all fields. This also produces false positives but combining it with a name match makes it prety efficient. Depending on the list and the source I vary the number I choose. Normaly choosing a couple numbers and hand sampling until I reach what i feel is a happy medium.

These methods have helped me narrow 15k addresses down into a more accurate 10k. I think any time you try matching like this you are going to get false positives, but if you are looking for a way to flag some for human intervention then this can be a pretty good test. Best of Luck.

Sorry I meant to mention that the benifit of these is no regex, just plain old DB functions that are easy and quickly handled. For better results you could create a new colum and do some normalization like suggested by some above.

___________
Eric Hodges

In reply to Re: De Duping Street Addresses Fuzzily by eric256
in thread De Duping Street Addresses Fuzzily by patrickrock

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks