Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

De Duping Street Addresses Fuzzily

by patrickrock (Beadle)
on Jan 31, 2005 at 21:41 UTC ( [id://426737]=perlquestion: print w/replies, xml ) Need Help??

patrickrock has asked for the wisdom of the Perl Monks concerning the following question:

Ok, first off, I am working with a legitimate customer database. Real people who have really ordered stuff from my company, Ingersoll-Rand. Just so you know I am not a spammer. I am working with our sales team to de-dupe their customer database. However the duplicates it is filled with is not: 555 Some Street and 555 Some Street, but instead it might be:


555 Some Street
555 Some St.
555 Some Street, ste 666
555 Some St., Suite 666

etc...

You get the idea.

They will have human eyes with experience with the customers to vet list, but I would like to be able to at least flag the problem children and make their lives easier.

Any hints on how to attack this from a regex point of view?

Thanks in advance, Pat Rock

Replies are listed 'Best First'.
Re: De Duping Street Addresses Fuzzily
by legato (Monk) on Feb 01, 2005 at 04:06 UTC

    Well, the first thing I would do is use the US Postal Service Web API. You need to request permission, but they almost always grant it (there's even a checkbox on the application that says you'll be using it to cleanse address databases). This will allow you to send the addresses as they exist in your DB, and get the USPS normalized "official" address back.

    For example:

    4321 Somewhere St. #305 St. Louis, MO 98765
    Might give you:
    4321 SOMEWHERE ST APT 305 SAINT LOUIS, MO 98765-0123

    Variations on that address should result in the same canonical address. You can then compare to see if there are duplicate canonical addresses with a simple string equality.

    You may need to do a little data cleaning, like removing multiple spaces ($address=~s{\s+}{\x20}g;, for example) before running the compare, but this should catch the vast majority of your duplicates.

    Anima Legato
    .oO all things connect through the motion of the mind

Re: De Duping Street Addresses Fuzzily
by Limbic~Region (Chancellor) on Feb 01, 2005 at 01:01 UTC
    patrickrock,
    I am surprised no one has mentioned Lingua::EN::AddressParse yet. It will not be a 100% solution. As indicated elsewhere in this thread, commercial products such as Group 1 are really good at this. Using the module though, you can probably reduce the amount of work that needs to be done by hand to about 10%. You could use something like Geo::Coder::US to help determine if the address you have is actually valid. A bit more research on CPAN might turn up even more goodies.

    Cheers - L~R

    On further review of this module, it appears the Parse::RecDescent grammar for US addresses could use some TLC. The author, Kim Ryan, appears to be from down under and complex US addresses don't seem to get parsed correctly. I bet someone here can improve it though ;-)

      Ditto for the suggestion of using Lingua::EN::AddressParse. Even if the high level methods don't suit your exact problem space, you may find the back-end code and normalisations useful in putting your data in a more consistent format before sorting/processing. It also handles a number of special cases that occur in the real world.
Re: De Duping Street Addresses Fuzzily
by brian_d_foy (Abbot) on Jan 31, 2005 at 22:46 UTC

    When i was looking at the postal regulations for my periodicals license for The Perl Review, I discovered that there is a whole industry that does just this: give them a list and they give it back to you with duplicates removed and addresses cleaned-up.

    This service comes as a true service (someone else does the work) or off-the-shelf software. Depending on how much work you have to do and how much it is worth to the company, you might want to skip doing this yourself.

    However, if you program it yourself, part of the solution (along with what people have already mentioned) is getting the canonical address. People will often give you their version of their address (for instance, I can't remember if my street is a Road or an Avenue, so I use them interchangeably). The US Postal Service has all sorts of data and tools to help you figure out what it should be based on the zip code. Other post offices have similar things (the Royal Mail address lookup kicks ass). From there you get closer to finding the duplicates than just looking at the address the customer gave you or someone keyed in.

    Good luck and let us know how it turns out.

    --
    brian d foy <bdfoy@cpan.org>
Re: De Duping Street Addresses Fuzzily
by Fletch (Bishop) on Jan 31, 2005 at 22:32 UTC

    First of all you need to define what "fuzzily matching" means. From your sample data I'd say you want "same city/state/zip with the same street name and number", or something close to that. Once you've decided on that, you need to tokenize the address and map what you've got in the DB into a canonical form. Split on spaces and write a little parser with logic something like:

    • Look for street number first
    • Next look for a direction or abbreviation (N., North) and canonicalize it (NORTH)
    • read up anything that doesn't look like a road type (i.e. stop when you see a Street, St., Ave., . . .) and canonicalize that to uppercase
    • canonicalize the road type to the full word uppercased (St. to STREET)
    • If there's anything left like Suite \d+ handle it similarly, or just ignore (depending on what your definition of equals winds up being)
    • Create the canonical key by concatenating all these parts together separated by spaces

    Once you've parsed everything into this canonical form push the real data onto a hash-of-arrays keyed by the canonical form (for large enough data sets you may want to use DB_File or the like). Then the last step is to print the canonical form and the raw data for any keys for which there's more than one entry for the given canonical form.

Re: De Duping Street Addresses Fuzzily
by eyepopslikeamosquito (Archbishop) on Feb 01, 2005 at 01:58 UTC

    I wrote one of these in C years ago, before I knew Perl. We got the job because the mailing house didn't fancy paying hundreds of thousands of dollars for a commercial US address deduplicator that didn't work particularly well on Australian addresses. The job took months, was a fixed price contract, and I think we lost money on that one. I remember that squeezing out high performance when de-duplicating millions of addresses was a challenge.

    The obvious general approach is to parse the addresses into a canonical internal form -- then use that to compare addresses. This sort of software is necessarily riddled with heuristics and ambiguities and can never be perfect -- for example, does "Mr John and Bill Camel" mean "Mr John" and "Mr Bill Camel" or "Mr John Camel" and "Mr Bill Camel"? For performance reasons you can't afford to compare every address with every other one, so you need to break them into "buckets" and compare all addresses in each bucket. How do you choose the buckets? Not sure, but I remember bucketing on post code worked out quite well for us.

    This thread may be of interest: Fuzzy matching of postal addresses on comp.lang.python of 17-jan-2005

    Update: Kim Ryan has years of commercial experience in this field, so I suggest you check out his CPAN modules.

    Update: See also: Re^3: Split first and last names (References on Parsing Names and Addresses)

Re: De Duping Street Addresses Fuzzily
by eric256 (Parson) on Feb 01, 2005 at 00:10 UTC

    Occasionaly I have to de-dup mailing lists we receive from the government. Since my lists often include duplicates of the same address and duplicate names (different addresses) I use a variety of methods.

    First I get the list with identical names and cities (uppercaseing both to avoid case issues). I suppose that probably gets false positives but we would prefer that in our case.

    Then I take all the addresses and pull ones that the first 7 characters match on. It is pretty hard to have the same first 7 and note be a duplicate. Agian that is with full uppercasing of all fields. This also produces false positives but combining it with a name match makes it prety efficient. Depending on the list and the source I vary the number I choose. Normaly choosing a couple numbers and hand sampling until I reach what i feel is a happy medium.

    These methods have helped me narrow 15k addresses down into a more accurate 10k. I think any time you try matching like this you are going to get false positives, but if you are looking for a way to flag some for human intervention then this can be a pretty good test. Best of Luck.

    Sorry I meant to mention that the benifit of these is no regex, just plain old DB functions that are easy and quickly handled. For better results you could create a new colum and do some normalization like suggested by some above.


    ___________
    Eric Hodges
Re: De Duping Street Addresses Fuzzily
by punkish (Priest) on Jan 31, 2005 at 22:24 UTC
    it is a tedious problem, but you could start by splitting each record into its components -- streetnum, streetpredir, streetname, streettype, streetpostdir, etc... There are accepted standard values available. Just search on the web. Then match the split-ted values against your reference lookup table. Once you have done that, you could come up with some kind of scoring to flag the most likely to least likely. Depending on how many records you have to scan and how often you have to do this, as well as how important it is for you to not have false positives/negatives in the match, you can decide if this exercise is worth your time. Otoh, if your employer is paying you for this...
Re: De Duping Street Addresses Fuzzily
by bgreenlee (Friar) on Feb 01, 2005 at 00:58 UTC

    One case you might want to watch out for that won't be caught by sorting is "Fifth St." vs. "5th St.". You could clean those up with a simple substitution hash ({ '1st' => 'First', '2nd' => 'Second', ...}), or have a look at the modules Lingua::EN::Numbers and Lingua::EN::Numbers::Ordinate and write something to handle the general case.

    -b

Re: De Duping Street Addresses Fuzzily
by Grundle (Scribe) on Jan 31, 2005 at 22:47 UTC
    Honestly I think that a huge part of this problem could be solved by using the Database as effectively as possible. I am hoping that you are using a standard DB such as Oracle, MySQL, etc. The info, then, that you want to search through can be sorted using an "ORDER BY" statement on the address number and street name. This will give a sorted list that will basically group all the dupes next to each other.

    As for the rest, all you need to do is match address (number + street name) and make a hash or some other data structure to store the accepted spellings of common street names or other address conventions. (such ast st. blvd. etcetera). You can have the program run through this hash and transform them to a common output and you should get duplicate output, with different primary keys.

    This is a general overview of your problem. There are obviously going to be some edge cases come up when you tackle this problem. This is quite a large problem, but I think that if you are able to have the DB work for you it will simplify things tremendously.
Re: De Duping Street Addresses Fuzzily
by johnnywang (Priest) on Jan 31, 2005 at 23:53 UTC
    I'd treat them as a sorting problem, at least that way, it's easier to see candidates are grouped together. If we can assume every record has a zip code, then you can sort first by zip, and then within the zip, sort by street number and street name (leaving out the str., blvd). Once that's done, it will give you a better idea how to proceed, i.e., what kind of ambiguities people use, and probably do some mapping of the street", "blvd", etc.
Re: De Duping Street Addresses Fuzzily
by patrickrock (Beadle) on Feb 01, 2005 at 00:24 UTC
    guys. i appreciate your help. if nothing else you have helped me really break the problem into more logical steps. I was just really overwhelmed first and foremost by the fact that there are 100,000 addresses in total. But you are right, between the DB (oracle 8) and some simple perl I can give them something they should be able to work with.
Re: De Duping Street Addresses Fuzzily
by Plankton (Vicar) on Feb 01, 2005 at 04:57 UTC
    You could always just buy a solution. I used this in a previous job. Worked pretty good actually. http://www.intelligentsearch.com.
Re: De Duping Street Addresses Fuzzily
by Animator (Hermit) on Jan 31, 2005 at 22:27 UTC

    What about taking the first number, the second word (and the first two letters of the thrid word) and look if that string occures multiples times.

    It won't find all duplicates, but it should find some I guess...

Re: De Duping Street Addresses Fuzzily
by Crian (Curate) on Feb 01, 2005 at 12:19 UTC

    We are doing something simular in our company (for addresses is germany) as one small part of our jobs. Therefore the addresses get plausibilised (if I can say this in this way(?)) with some kind of technic simulare to the technics described in the posts above.

    If you want accurate results, don't think it's fast done. Perhaps it could be cheaper to buy a complete solution.

    Data that comes from humans contains all kind of errors and rubish you can and can not think of. Specially if it's a large mass of data (such as 50000000 addresses or else), there will be almost everything in your data - be aware.

Re: De Duping Street Addresses Fuzzily
by nmerriweather (Friar) on Feb 01, 2005 at 17:30 UTC
    We did this in our office for a client once... First, we normalized the database to USPS standardized text, based on the chart of accepted postal abbreviations: http://www.usps.com/ncsc/lookups/usps_abbreviations.html That alone turned probably 90% of the potential dupes into exact duplicates as for the rest -- if there are multiple apartments/suites/floors on the same address, we suggested that they send that anyways -- there isn't any real way to tell that address isn't a separate entity. could be neighbors or co-workers who should each have gotten their own postcard.
Re: De Duping Street Addresses Fuzzily
by mamboking (Monk) on Feb 01, 2005 at 19:18 UTC
    Here's some code I used to parse addresses. Maybe it will help.
    package AddressParser; use DBI qw(:sql_types); use DBUtil; use Utils; use Log::Log4perl qw(get_logger); use strict; use warnings; my $us_zip_re = qr/^\s*(([\w.]+\W)+)\s*([a-zA-Z]{2})\s*((\d{5})[ -]?(\ +d{4})?)(\d{2})?\s*$/; my $can_post_code_re = qr/^\s*((\w+\W)+)\s*([a-zA-Z]{2})\s+([a-zA-Z]\d +[a-zA-Z][ -]?\d[a-zA-Z]\d)\s*$/; my $foreign_canada_re = qr/^\s*((\w+\W)+)\s*([a-zA-Z]+)\s+CANADA\s+([a +-zA-Z]\d[a-zA-Z][ -]?\d[a-zA-Z]\d)\s*$/i; my $foreign_romania_re = qr/^\s*((\w+\W)+)\s*ROMANIA\s+(\d{4,}[ -]?(\d +{4})?)\s*$/i; my $foreign_france_re = qr/^\s*((\w+\W)+)\s*FRANCE\s+(\d{4,}[ -]?(\d{4 +})?)\s*$/i; my $street_re = qr/^\s*([a-zA-Z#]+\s*\d+\s+)?([a-zA-Z]?\d+[a-zA-Z]?)\s ++(\S.*)$/; my $po_box_re = qr/^\s*((\w+\W+)*\s*[bB][oO][xX])\s*(\d+)/; #--------------------------------------------------------------------- # Constructor #--------------------------------------------------------------------- sub new { my $self = { _state_table => undef }; my ($class, %arg) = @_; bless $self, $class; # $self->_init_state_table; return $self; } sub _init_state_table { my ($self) = @_; my $dbh; my $sth; my $query; my @row; my $key; my $name; $query = <<QUERY_END select abbrev, name from state QUERY_END ; $dbh = DBUtil::fetch_connection('shared'); $sth = $dbh->prepare($query); $sth->execute; while (@row = $sth->fetchrow_array) { ($key, $name) = @row; $self->{_state_table}{$key} = $name; } $sth->finish; } sub insert_zip { my ($self, $beta_address) = @_; my $zipcd; my $line; my @lines; ($zipcd, @lines) = @$beta_address; if (!defined($zipcd)) { $zipcd = ''; } $zipcd =~ s/^\s+//; $zipcd =~ s/\s+$//; for (my $i=0; $i < scalar(@lines); $i++) { if (defined($lines[$i])) { $lines[$i] =~ s/ZIPCD/$zipcd/; } } return \@lines; } sub find_address_lines { my ($self, $lines_ref) = @_; my $line; my $line_index; my $street_line_index = -1; my $zip_line_index = -1; $line_index = scalar(@$lines_ref); foreach $line (reverse(@$lines_ref)) { if (defined($line)) { # Find the street line, but only if we've already found the zip +line if (($zip_line_index > -1) && (($line =~ /$street_re/) || ($line =~ /$po_box_re/))) { $street_line_index = $line_index - 1; last; } if (($line =~ /$us_zip_re/) || ($line =~ /$can_post_code_re/) || ($line =~ /$foreign_canada_re/) || ($line =~ /$foreign_romania_re/) || ($line =~ /$foreign_france_re/)) { $zip_line_index = $line_index - 1; } } $line_index--; } if ($zip_line_index == -1) { get_logger()->error("Couldn't find zip code"); } if ($street_line_index == -1) { get_logger()->error("Couldn't find street"); } if (($zip_line_index == -1) || ($street_line_index == -1)) { get_logger()->error("\n\t" . join("\n\t", (map { defined($_) ? $_ +: '' } @$lines_ref))); return (undef, undef); } else { return (@$lines_ref[$street_line_index], @$lines_ref[$zip_line_ind +ex]); } } sub parse { my ($self, $street_line, $city_line) = @_; my $address = Address->new(); if ($street_line =~ /$street_re/) { $address->set('street_number', $2); $address->set('street_name', $3); } elsif ($street_line =~ /$po_box_re/) { $address->set('street_number', $3); $address->set('street_name', $1); } else { get_logger()->error("Couldn't parse street: $street_line"); return undef; } if ($city_line =~ /$us_zip_re/) { $address->set('city', $1); $address->set('state', $3); $address->set('zip', $4); $address->set('country', 'US'); } elsif ($city_line =~ /$can_post_code_re/) { $address->set('city', $1); $address->set('state', $3); $address->set('zip', $4); $address->set('country', 'CA'); } elsif ($city_line =~ /$foreign_canada_re/) { $address->set('city', $1); $address->set('zip', $4); $address->set('country', 'CA'); } elsif ($city_line =~ /$foreign_romania_re/) { $address->set('city', $1); $address->set('zip', $3); $address->set('country', 'RO'); } elsif ($city_line =~ /$foreign_france_re/) { $address->set('city', $1); $address->set('zip', $3); $address->set('country', 'FR'); } else { get_logger()->error("Couldn't parse city: $city_line"); return undef; } return $address; } sub parse_address { my ($self, $address_lines, $address_type) = @_; my $street_line; my $city_line; if (defined($address_type) && ($address_type eq 'BETA')) { $address_lines = $self->insert_zip($address_lines); } ($street_line, $city_line) = $self->find_address_lines($address_line +s); if (!defined($street_line) || !defined($city_line)) { return undef; } return $self->parse($street_line, $city_line); } 1;
Re: De Duping Street Addresses Fuzzily
by gwhite (Friar) on Feb 07, 2005 at 16:08 UTC

    Merge/purge deduping services go for about $6.00/thousand names. Plus a lot of the services can zip+4 your zipcodes and standardize all the fields. You can also do things like a household match versus name match, update zip codes if the names in your list are a couple of years old.

    I would pay the $600.00 a let the pros do it.

    g_White
Re: De Duping Street Addresses Fuzzily
by m-rau (Scribe) on Feb 07, 2005 at 16:49 UTC
    There are severe problems to identify typos. You might want to use Text::Levenshtein to identify typos. But I really do not know if this works. The module implements the Levenshtein edit distance, a measure of the degree of proximity between two strings. The distance is the number of substituations, deletions or insertions (edits) needed to transform one string into the other one (and vice versa). Of course, you can use this after having cleansed the data, only (fifth => 5th, etc.).
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://426737]
Approved by Corion
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-16 11:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found