De Duping Street Addresses Fuzzily

http://qs321.pair.com?node_id=426737

patrickrock has asked for the wisdom of the Perl Monks concerning the following question:

Ok, first off, I am working with a legitimate customer database. Real people who have really ordered stuff from my company, Ingersoll-Rand. Just so you know I am not a spammer. I am working with our sales team to de-dupe their customer database. However the duplicates it is filled with is not: 555 Some Street and 555 Some Street, but instead it might be:

555 Some Street
555 Some St.
555 Some Street, ste 666
555 Some St., Suite 666

etc...

You get the idea.

They will have human eyes with experience with the customers to vet list, but I would like to be able to at least flag the problem children and make their lives easier.

Any hints on how to attack this from a regex point of view?

Thanks in advance, Pat Rock

Comment on De Duping Street Addresses Fuzzily

Replies are listed 'Best First'.

Re: De Duping Street Addresses Fuzzily
by legato (Monk) on Feb 01, 2005 at 04:06 UTC

Well, the first thing I would do is use the US Postal Service Web API. You need to request permission, but they almost always grant it (there's even a checkbox on the application that says you'll be using it to cleanse address databases). This will allow you to send the addresses as they exist in your DB, and get the USPS normalized "official" address back.

For example:

4321 Somewhere St. #305 
St. Louis, MO 98765
[download]

4321 SOMEWHERE ST APT 305
SAINT LOUIS, MO 98765-0123
[download]

Variations on that address should result in the same canonical address. You can then compare to see if there are duplicate canonical addresses with a simple string equality.

You may need to do a little data cleaning, like removing multiple spaces ($address=~s{\s+}{\x20}g;, for example) before running the compare, but this should catch the vast majority of your duplicates.

Anima Legato
.oO all things connect through the motion of the mind

[reply]
[d/l]
[select]

Re: De Duping Street Addresses Fuzzily
by Limbic~Region (Chancellor) on Feb 01, 2005 at 01:01 UTC

Lingua::EN::AddressParse

Cheers - L~R

On further review of this module, it appears the Parse::RecDescent grammar for US addresses could use some TLC. The author, Kim Ryan, appears to be from down under and complex US addresses don't seem to get parsed correctly. I bet someone here can improve it though ;-)

Re^2: De Duping Street Addresses Fuzzily

by Anonymous Monk on Feb 01, 2005 at 01:46 UTC

Lingua::EN::AddressParse

Re: De Duping Street Addresses Fuzzily
by brian_d_foy (Abbot) on Jan 31, 2005 at 22:46 UTC

When i was looking at the postal regulations for my periodicals license for The Perl Review, I discovered that there is a whole industry that does just this: give them a list and they give it back to you with duplicates removed and addresses cleaned-up.

This service comes as a true service (someone else does the work) or off-the-shelf software. Depending on how much work you have to do and how much it is worth to the company, you might want to skip doing this yourself.

However, if you program it yourself, part of the solution (along with what people have already mentioned) is getting the canonical address. People will often give you their version of their address (for instance, I can't remember if my street is a Road or an Avenue, so I use them interchangeably). The US Postal Service has all sorts of data and tools to help you figure out what it should be based on the zip code. Other post offices have similar things (the Royal Mail address lookup kicks ass). From there you get closer to finding the duplicates than just looking at the address the customer gave you or someone keyed in.

Good luck and let us know how it turns out.

--
brian d foy <bdfoy@cpan.org>

Re: De Duping Street Addresses Fuzzily
by Fletch (Bishop) on Jan 31, 2005 at 22:32 UTC

First of all you need to define what "fuzzily matching" means. From your sample data I'd say you want "same city/state/zip with the same street name and number", or something close to that. Once you've decided on that, you need to tokenize the address and map what you've got in the DB into a canonical form. Split on spaces and write a little parser with logic something like:

Look for street number first
Next look for a direction or abbreviation (N., North) and canonicalize it (NORTH)
read up anything that doesn't look like a road type (i.e. stop when you see a Street, St., Ave., . . .) and canonicalize that to uppercase
canonicalize the road type to the full word uppercased (St. to STREET)
If there's anything left like Suite \d+ handle it similarly, or just ignore (depending on what your definition of equals winds up being)
Create the canonical key by concatenating all these parts together separated by spaces

Once you've parsed everything into this canonical form push the real data onto a hash-of-arrays keyed by the canonical form (for large enough data sets you may want to use DB_File or the like). Then the last step is to print the canonical form and the raw data for any keys for which there's more than one entry for the given canonical form.

Re: De Duping Street Addresses Fuzzily
by eyepopslikeamosquito (Archbishop) on Feb 01, 2005 at 01:58 UTC

I wrote one of these in C years ago, before I knew Perl. We got the job because the mailing house didn't fancy paying hundreds of thousands of dollars for a commercial US address deduplicator that didn't work particularly well on Australian addresses. The job took months, was a fixed price contract, and I think we lost money on that one. I remember that squeezing out high performance when de-duplicating millions of addresses was a challenge.

The obvious general approach is to parse the addresses into a canonical internal form -- then use that to compare addresses. This sort of software is necessarily riddled with heuristics and ambiguities and can never be perfect -- for example, does "Mr John and Bill Camel" mean "Mr John" and "Mr Bill Camel" or "Mr John Camel" and "Mr Bill Camel"? For performance reasons you can't afford to compare every address with every other one, so you need to break them into "buckets" and compare all addresses in each bucket. How do you choose the buckets? Not sure, but I remember bucketing on post code worked out quite well for us.

This thread may be of interest: Fuzzy matching of postal addresses on comp.lang.python of 17-jan-2005

Update: Kim Ryan has years of commercial experience in this field, so I suggest you check out his CPAN modules.

Update: See also: Re^3: Split first and last names (References on Parsing Names and Addresses)

Re: De Duping Street Addresses Fuzzily
by eric256 (Parson) on Feb 01, 2005 at 00:10 UTC

Occasionaly I have to de-dup mailing lists we receive from the government. Since my lists often include duplicates of the same address and duplicate names (different addresses) I use a variety of methods.

First I get the list with identical names and cities (uppercaseing both to avoid case issues). I suppose that probably gets false positives but we would prefer that in our case.

Then I take all the addresses and pull ones that the first 7 characters match on. It is pretty hard to have the same first 7 and note be a duplicate. Agian that is with full uppercasing of all fields. This also produces false positives but combining it with a name match makes it prety efficient. Depending on the list and the source I vary the number I choose. Normaly choosing a couple numbers and hand sampling until I reach what i feel is a happy medium.

These methods have helped me narrow 15k addresses down into a more accurate 10k. I think any time you try matching like this you are going to get false positives, but if you are looking for a way to flag some for human intervention then this can be a pretty good test. Best of Luck.

Sorry I meant to mention that the benifit of these is no regex, just plain old DB functions that are easy and quickly handled. For better results you could create a new colum and do some normalization like suggested by some above.

___________
Eric Hodges

Re: De Duping Street Addresses Fuzzily
by punkish (Priest) on Jan 31, 2005 at 22:24 UTC

it is a tedious problem, but you could start by splitting each record into its components -- streetnum, streetpredir, streetname, streettype, streetpostdir, etc... There are accepted standard values available. Just search on the web. Then match the split-ted values against your reference lookup table. Once you have done that, you could come up with some kind of scoring to flag the most likely to least likely. Depending on how many records you have to scan and how often you have to do this, as well as how important it is for you to not have false positives/negatives in the match, you can decide if this exercise is worth your time. Otoh, if your employer is paying you for this...

Re: De Duping Street Addresses Fuzzily
by bgreenlee (Friar) on Feb 01, 2005 at 00:58 UTC

One case you might want to watch out for that won't be caught by sorting is "Fifth St." vs. "5th St.". You could clean those up with a simple substitution hash ({ '1st' => 'First', '2nd' => 'Second', ...}), or have a look at the modules Lingua::EN::Numbers and Lingua::EN::Numbers::Ordinate and write something to handle the general case.

-b

Re: De Duping Street Addresses Fuzzily
by Grundle (Scribe) on Jan 31, 2005 at 22:47 UTC

Re: De Duping Street Addresses Fuzzily
by johnnywang (Priest) on Jan 31, 2005 at 23:53 UTC

I'd treat them as a sorting problem, at least that way, it's easier to see candidates are grouped together. If we can assume every record has a zip code, then you can sort first by zip, and then within the zip, sort by street number and street name (leaving out the str., blvd). Once that's done, it will give you a better idea how to proceed, i.e., what kind of ambiguities people use, and probably do some mapping of the street", "blvd", etc.

Re: De Duping Street Addresses Fuzzily
by patrickrock (Beadle) on Feb 01, 2005 at 00:24 UTC

guys. i appreciate your help. if nothing else you have helped me really break the problem into more logical steps. I was just really overwhelmed first and foremost by the fact that there are 100,000 addresses in total. But you are right, between the DB (oracle 8) and some simple perl I can give them something they should be able to work with.

Re: De Duping Street Addresses Fuzzily
by Plankton (Vicar) on Feb 01, 2005 at 04:57 UTC

Snakehandlers.net

Re: De Duping Street Addresses Fuzzily
by Animator (Hermit) on Jan 31, 2005 at 22:27 UTC

What about taking the first number, the second word (and the first two letters of the thrid word) and look if that string occures multiples times.

It won't find all duplicates, but it should find some I guess...

Re: De Duping Street Addresses Fuzzily
by Crian (Curate) on Feb 01, 2005 at 12:19 UTC

We are doing something simular in our company (for addresses is germany) as one small part of our jobs. Therefore the addresses get plausibilised (if I can say this in this way(?)) with some kind of technic simulare to the technics described in the posts above.

If you want accurate results, don't think it's fast done. Perhaps it could be cheaper to buy a complete solution.

Data that comes from humans contains all kind of errors and rubish you can and can not think of. Specially if it's a large mass of data (such as 50000000 addresses or else), there will be almost everything in your data - be aware.

Re: De Duping Street Addresses Fuzzily
by nmerriweather (Friar) on Feb 01, 2005 at 17:30 UTC

We did this in our office for a client once... First, we normalized the database to USPS standardized text, based on the chart of accepted postal abbreviations: http://www.usps.com/ncsc/lookups/usps_abbreviations.html That alone turned probably 90% of the potential dupes into exact duplicates as for the rest -- if there are multiple apartments/suites/floors on the same address, we suggested that they send that anyways -- there isn't any real way to tell that address isn't a separate entity. could be neighbors or co-workers who should each have gotten their own postcard.

Re: De Duping Street Addresses Fuzzily
by mamboking (Monk) on Feb 01, 2005 at 19:18 UTC

package AddressParser;
use DBI qw(:sql_types); 
use DBUtil;
use Utils;
use Log::Log4perl qw(get_logger);

use strict;
use warnings;

my $us_zip_re = qr/^\s*(([\w.]+\W)+)\s*([a-zA-Z]{2})\s*((\d{5})[ -]?(\
+d{4})?)(\d{2})?\s*$/;
my $can_post_code_re = qr/^\s*((\w+\W)+)\s*([a-zA-Z]{2})\s+([a-zA-Z]\d
+[a-zA-Z][ -]?\d[a-zA-Z]\d)\s*$/;
my $foreign_canada_re = qr/^\s*((\w+\W)+)\s*([a-zA-Z]+)\s+CANADA\s+([a
+-zA-Z]\d[a-zA-Z][ -]?\d[a-zA-Z]\d)\s*$/i;
my $foreign_romania_re = qr/^\s*((\w+\W)+)\s*ROMANIA\s+(\d{4,}[ -]?(\d
+{4})?)\s*$/i;
my $foreign_france_re = qr/^\s*((\w+\W)+)\s*FRANCE\s+(\d{4,}[ -]?(\d{4
+})?)\s*$/i;
my $street_re = qr/^\s*([a-zA-Z#]+\s*\d+\s+)?([a-zA-Z]?\d+[a-zA-Z]?)\s
++(\S.*)$/;
my $po_box_re = qr/^\s*((\w+\W+)*\s*[bB][oO][xX])\s*(\d+)/;

#---------------------------------------------------------------------
# Constructor
#---------------------------------------------------------------------
sub new {
  my $self = { _state_table => undef };
  my ($class, %arg) = @_;
  bless $self, $class;
#  $self->_init_state_table;
  return $self;
}

sub _init_state_table {
  my ($self) = @_;
  my $dbh;
  my $sth;
  my $query;
  my @row;
  my $key;
  my $name;

  $query = <<QUERY_END
select
  abbrev,
  name
from
  state
QUERY_END
    ;

  $dbh = DBUtil::fetch_connection('shared');
  $sth = $dbh->prepare($query);
  $sth->execute;

  while (@row = $sth->fetchrow_array) {
    ($key, $name) = @row;
    $self->{_state_table}{$key} = $name;
  }

  $sth->finish;
}

sub insert_zip {
  my ($self, $beta_address) = @_;

  my $zipcd;
  my $line;

  my @lines;

  ($zipcd, @lines) = @$beta_address;
  if (!defined($zipcd)) {
    $zipcd = '';
  }
  $zipcd =~ s/^\s+//;
  $zipcd =~ s/\s+$//;

  for (my $i=0; $i < scalar(@lines); $i++) {
    if (defined($lines[$i])) {
      $lines[$i] =~ s/ZIPCD/$zipcd/;
    }
  }

  return \@lines;
}

sub find_address_lines {
  my ($self, $lines_ref) = @_;
  my $line;
  my $line_index;
  my $street_line_index = -1;
  my $zip_line_index = -1;

  $line_index = scalar(@$lines_ref);
  foreach $line (reverse(@$lines_ref)) {
    if (defined($line)) {
      # Find the street line, but only if we've already found the zip 
+line
      if (($zip_line_index > -1) && 
      (($line =~ /$street_re/) ||
       ($line =~ /$po_box_re/))) {
    $street_line_index = $line_index - 1;
    last;
      }
      
      if (($line =~ /$us_zip_re/) ||
      ($line =~ /$can_post_code_re/) ||
      ($line =~ /$foreign_canada_re/) ||
      ($line =~ /$foreign_romania_re/) ||
      ($line =~ /$foreign_france_re/)) {
    $zip_line_index = $line_index - 1;
      }
    }

    $line_index--;
  }

  if ($zip_line_index == -1) {
    get_logger()->error("Couldn't find zip code");
  }

  if ($street_line_index == -1) {
    get_logger()->error("Couldn't find street");
  }

  if (($zip_line_index == -1) ||
      ($street_line_index == -1)) {
    get_logger()->error("\n\t" . join("\n\t", (map { defined($_) ? $_ 
+: '' } @$lines_ref)));
    return (undef, undef);
  }
  else {
    return (@$lines_ref[$street_line_index], @$lines_ref[$zip_line_ind
+ex]);
  }
}

sub parse {
  my ($self, $street_line, $city_line) = @_;
  my $address = Address->new();

  if ($street_line =~ /$street_re/) {
    $address->set('street_number', $2);
    $address->set('street_name', $3);
  }
  elsif ($street_line =~ /$po_box_re/) {
    $address->set('street_number', $3);
    $address->set('street_name', $1);
  }
  else {
    get_logger()->error("Couldn't parse street: $street_line");
    return undef;
  }

  if ($city_line =~ /$us_zip_re/) {
    $address->set('city', $1);
    $address->set('state', $3);
    $address->set('zip', $4);
    $address->set('country', 'US');
  }
  elsif ($city_line =~ /$can_post_code_re/) {
    $address->set('city', $1);
    $address->set('state', $3);
    $address->set('zip', $4);
    $address->set('country', 'CA');
  }
  elsif ($city_line =~ /$foreign_canada_re/) {
    $address->set('city', $1);
    $address->set('zip', $4);
    $address->set('country', 'CA');
  }
  elsif ($city_line =~ /$foreign_romania_re/) {
    $address->set('city', $1);
    $address->set('zip', $3);
    $address->set('country', 'RO');
  }
  elsif ($city_line =~ /$foreign_france_re/) {
    $address->set('city', $1);
    $address->set('zip', $3);
    $address->set('country', 'FR');
  }
  else {
    get_logger()->error("Couldn't parse city: $city_line");
    return undef;
  }

  return $address;
}

sub parse_address {
  my ($self, $address_lines, $address_type) = @_;
  my $street_line;
  my $city_line;

  if (defined($address_type) && ($address_type eq 'BETA')) {
    $address_lines = $self->insert_zip($address_lines);
  }

  ($street_line, $city_line) = $self->find_address_lines($address_line
+s);

  if (!defined($street_line) || !defined($city_line)) {
    return undef;
  }

  return $self->parse($street_line, $city_line);
}

1;
[download]

Re: De Duping Street Addresses Fuzzily
by gwhite (Friar) on Feb 07, 2005 at 16:08 UTC

Merge/purge deduping services go for about $6.00/thousand names. Plus a lot of the services can zip+4 your zipcodes and standardize all the fields. You can also do things like a household match versus name match, update zip codes if the names in your list are a couple of years old.

I would pay the $600.00 a let the pros do it.

g_White

Re: De Duping Street Addresses Fuzzily
by m-rau (Scribe) on Feb 07, 2005 at 16:49 UTC

Text::Levenshtein

A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Seekers of Perl Wisdom