patrickrock has asked for the wisdom of the Perl Monks concerning the following question:
555 Some Street
555 Some St.
555 Some Street, ste 666
555 Some St., Suite 666
etc...
You get the idea.
They will have human eyes with experience with the customers to vet list, but I would like to be able to at least flag the problem children and make their lives easier.
Any hints on how to attack this from a regex point of view?
Thanks in advance, Pat Rock
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: De Duping Street Addresses Fuzzily
by legato (Monk) on Feb 01, 2005 at 04:06 UTC | |
Well, the first thing I would do is use the US Postal Service Web API. You need to request permission, but they almost always grant it (there's even a checkbox on the application that says you'll be using it to cleanse address databases). This will allow you to send the addresses as they exist in your DB, and get the USPS normalized "official" address back. For example: Might give you:
Variations on that address should result in the same canonical address. You can then compare to see if there are duplicate canonical addresses with a simple string equality. You may need to do a little data cleaning, like removing multiple spaces ($address=~s{\s+}{\x20}g;, for example) before running the compare, but this should catch the vast majority of your duplicates. Anima Legato | [reply] [d/l] [select] |
Re: De Duping Street Addresses Fuzzily
by Limbic~Region (Chancellor) on Feb 01, 2005 at 01:01 UTC | |
I am surprised no one has mentioned Lingua::EN::AddressParse yet. It will not be a 100% solution. As indicated elsewhere in this thread, commercial products such as Group 1 are really good at this. Using the module though, you can probably reduce the amount of work that needs to be done by hand to about 10%. You could use something like Geo::Coder::US to help determine if the address you have is actually valid. A bit more research on CPAN might turn up even more goodies. Cheers - L~R On further review of this module, it appears the Parse::RecDescent grammar for US addresses could use some TLC. The author, Kim Ryan, appears to be from down under and complex US addresses don't seem to get parsed correctly. I bet someone here can improve it though ;-) | [reply] |
by Anonymous Monk on Feb 01, 2005 at 01:46 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by brian_d_foy (Abbot) on Jan 31, 2005 at 22:46 UTC | |
When i was looking at the postal regulations for my periodicals license for The Perl Review, I discovered that there is a whole industry that does just this: give them a list and they give it back to you with duplicates removed and addresses cleaned-up. This service comes as a true service (someone else does the work) or off-the-shelf software. Depending on how much work you have to do and how much it is worth to the company, you might want to skip doing this yourself. However, if you program it yourself, part of the solution (along with what people have already mentioned) is getting the canonical address. People will often give you their version of their address (for instance, I can't remember if my street is a Road or an Avenue, so I use them interchangeably). The US Postal Service has all sorts of data and tools to help you figure out what it should be based on the zip code. Other post offices have similar things (the Royal Mail address lookup kicks ass). From there you get closer to finding the duplicates than just looking at the address the customer gave you or someone keyed in. Good luck and let us know how it turns out.
-- brian d foy <bdfoy@cpan.org> | [reply] |
Re: De Duping Street Addresses Fuzzily
by Fletch (Bishop) on Jan 31, 2005 at 22:32 UTC | |
First of all you need to define what "fuzzily matching" means. From your sample data I'd say you want "same city/state/zip with the same street name and number", or something close to that. Once you've decided on that, you need to tokenize the address and map what you've got in the DB into a canonical form. Split on spaces and write a little parser with logic something like: Once you've parsed everything into this canonical form push the real data onto a hash-of-arrays keyed by the canonical form (for large enough data sets you may want to use DB_File or the like). Then the last step is to print the canonical form and the raw data for any keys for which there's more than one entry for the given canonical form. | [reply] |
Re: De Duping Street Addresses Fuzzily
by eyepopslikeamosquito (Archbishop) on Feb 01, 2005 at 01:58 UTC | |
I wrote one of these in C years ago, before I knew Perl. We got the job because the mailing house didn't fancy paying hundreds of thousands of dollars for a commercial US address deduplicator that didn't work particularly well on Australian addresses. The job took months, was a fixed price contract, and I think we lost money on that one. I remember that squeezing out high performance when de-duplicating millions of addresses was a challenge. The obvious general approach is to parse the addresses into a canonical internal form -- then use that to compare addresses. This sort of software is necessarily riddled with heuristics and ambiguities and can never be perfect -- for example, does "Mr John and Bill Camel" mean "Mr John" and "Mr Bill Camel" or "Mr John Camel" and "Mr Bill Camel"? For performance reasons you can't afford to compare every address with every other one, so you need to break them into "buckets" and compare all addresses in each bucket. How do you choose the buckets? Not sure, but I remember bucketing on post code worked out quite well for us. This thread may be of interest: Fuzzy matching of postal addresses on comp.lang.python of 17-jan-2005 Update: Kim Ryan has years of commercial experience in this field, so I suggest you check out his CPAN modules. Update: See also: Re^3: Split first and last names (References on Parsing Names and Addresses) | [reply] |
Re: De Duping Street Addresses Fuzzily
by eric256 (Parson) on Feb 01, 2005 at 00:10 UTC | |
Occasionaly I have to de-dup mailing lists we receive from the government. Since my lists often include duplicates of the same address and duplicate names (different addresses) I use a variety of methods. First I get the list with identical names and cities (uppercaseing both to avoid case issues). I suppose that probably gets false positives but we would prefer that in our case. Then I take all the addresses and pull ones that the first 7 characters match on. It is pretty hard to have the same first 7 and note be a duplicate. Agian that is with full uppercasing of all fields. This also produces false positives but combining it with a name match makes it prety efficient. Depending on the list and the source I vary the number I choose. Normaly choosing a couple numbers and hand sampling until I reach what i feel is a happy medium. These methods have helped me narrow 15k addresses down into a more accurate 10k. I think any time you try matching like this you are going to get false positives, but if you are looking for a way to flag some for human intervention then this can be a pretty good test. Best of Luck. Sorry I meant to mention that the benifit of these is no regex, just plain old DB functions that are easy and quickly handled. For better results you could create a new colum and do some normalization like suggested by some above. ___________ Eric Hodges | [reply] |
Re: De Duping Street Addresses Fuzzily
by punkish (Priest) on Jan 31, 2005 at 22:24 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by bgreenlee (Friar) on Feb 01, 2005 at 00:58 UTC | |
One case you might want to watch out for that won't be caught by sorting is "Fifth St." vs. "5th St.". You could clean those up with a simple substitution hash ({ '1st' => 'First', '2nd' => 'Second', ...}), or have a look at the modules Lingua::EN::Numbers and Lingua::EN::Numbers::Ordinate and write something to handle the general case. -b | [reply] [d/l] |
Re: De Duping Street Addresses Fuzzily
by Grundle (Scribe) on Jan 31, 2005 at 22:47 UTC | |
As for the rest, all you need to do is match address (number + street name) and make a hash or some other data structure to store the accepted spellings of common street names or other address conventions. (such ast st. blvd. etcetera). You can have the program run through this hash and transform them to a common output and you should get duplicate output, with different primary keys. This is a general overview of your problem. There are obviously going to be some edge cases come up when you tackle this problem. This is quite a large problem, but I think that if you are able to have the DB work for you it will simplify things tremendously. | [reply] |
Re: De Duping Street Addresses Fuzzily
by johnnywang (Priest) on Jan 31, 2005 at 23:53 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by patrickrock (Beadle) on Feb 01, 2005 at 00:24 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by Plankton (Vicar) on Feb 01, 2005 at 04:57 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by Animator (Hermit) on Jan 31, 2005 at 22:27 UTC | |
What about taking the first number, the second word (and the first two letters of the thrid word) and look if that string occures multiples times. It won't find all duplicates, but it should find some I guess... | [reply] |
Re: De Duping Street Addresses Fuzzily
by Crian (Curate) on Feb 01, 2005 at 12:19 UTC | |
We are doing something simular in our company (for addresses is germany) as one small part of our jobs. Therefore the addresses get plausibilised (if I can say this in this way(?)) with some kind of technic simulare to the technics described in the posts above. If you want accurate results, don't think it's fast done. Perhaps it could be cheaper to buy a complete solution. Data that comes from humans contains all kind of errors and rubish you can and can not think of. Specially if it's a large mass of data (such as 50000000 addresses or else), there will be almost everything in your data - be aware. | [reply] |
Re: De Duping Street Addresses Fuzzily
by nmerriweather (Friar) on Feb 01, 2005 at 17:30 UTC | |
| [reply] |
Re: De Duping Street Addresses Fuzzily
by mamboking (Monk) on Feb 01, 2005 at 19:18 UTC | |
| [reply] [d/l] |
Re: De Duping Street Addresses Fuzzily
by gwhite (Friar) on Feb 07, 2005 at 16:08 UTC | |
Merge/purge deduping services go for about $6.00/thousand names. Plus a lot of the services can zip+4 your zipcodes and standardize all the fields. You can also do things like a household match versus name match, update zip codes if the names in your list are a couple of years old.
g_White
| [reply] |
Re: De Duping Street Addresses Fuzzily
by m-rau (Scribe) on Feb 07, 2005 at 16:49 UTC | |
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |