Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Efficient Fuzzy Matching Of An Address

by Limbic~Region (Chancellor)
on Aug 19, 2008 at 17:05 UTC ( [id://705271]=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,

Assume you have a database full of addresses.
123 Main St.
Somewhere, USA 12345

Assume that the address is stored in a number of forms:

  • Exactly as entered
  • Standardized via Lingua::EN::AddressParse
  • Encoded as soundex via Text::Soundex (exact/standardized)
  • Encoded as metaphone via Text::Metaphone (exact/standardized)
  • Broken out into individual pieces (street, city, zip, etc)
  • Possibly each individual piece encoded as described above (exact/standardized)

The problem I am trying to solve is this: Given an address as input, find any "similar" addresses in the DB without comparing every address (via String::Approx or Text::Levenshtein for instance). Using just the encoding routines alone seem to have a lot of false positives and false negatives.

Anyone have any expertise in the area have some advice? I am sorry, I don't have any real data I can share.

Cheers - L~R

Replies are listed 'Best First'.
Re: Efficient Fuzzy Matching Of An Address
by BrowserUk (Patriarch) on Aug 19, 2008 at 18:10 UTC

    I think the mechanism I describe in the subthread starting at Re^3: Comparing text documents could be adapted to this purpose depending upon what your actual goal is?

    When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors?

      The latter (typos and transcription errors) with a caveat. If the input is '123 Main St' and the DB has a record of '356 Main Street', I would hope that the search wouldn't return "no results found". On the other hand, if the input was '12 Elm Ave' (two houses over), then the search should definately not find a match. Ultimately, I would like to return the best N matches ordered by degree of similarity.

      Thanks for the pointer to Re^3: Comparing text documents. If I get real motivated, I will try it out.

      Cheers - L~R

Re: Efficient Fuzzy Matching Of An Address
by eyepopslikeamosquito (Archbishop) on Aug 19, 2008 at 18:34 UTC

    I faced a similar problem years ago (the program was written in C) when processing millions of addresses to eliminate duplicates for a mailing house. It wasn't feasible to compare every address with every other address. What I ended up doing was partitioning the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket (i.e. with the same postcode) only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors.

      eyepopslikeamosquito,
      What I ended up doing was breaking up the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors.

      Let's say my input address is:

      123 Broadway Av
      Brownville, ME 04415

      Let's say the correct best DB address to match is

      123 Brodway Avenue
      Brownville Junction, ME 04451

      How would your approach work?

      Cheers - L~R

        It's a long time ago, but I'll try to remember how this address de-duplication program operated. The addresses were first parsed into an "internal (canonical) form": for example, "Av" and "Avenue" above would parse to the same internal number. The "internal forms" were compared, not the original text. Addresses that were too weird and couldn't be parsed were rejected for inspection by a human. As part of the parsing, suburbs were checked against an Australian postcode database and errors reported, so the postcodes were cleaned (sometimes by a human). Suburbs were then ignored, postcodes were considered more reliable (people sometimes lie about their suburb name in Australia because some suburbs are "posher" than others).

        There was also a user-configurable "number of transcription errors" allowed; if you allowed one transcription error, then "Broadway" and "Brodway" above would be considered a match. There was another parameter for transposition errors (e.g. "Gergory" versus "Gregory") and many other heuristics, like "Flat 6" versus "Flat 6A" versus "Apartment 6". Update: Perhaps a better technique, which I didn't know about at the time and as suggested by metaperl, is Jaro-Winkler distance. I remember trying classical Knuth Soundex and found it to be not useful in this domain.

        One of the many imperfect and sometimes amusing heuristics I remember was "Captain Cook Cruises": is this a gentleman with a title of Captain, first name Cook, last name Cruises, or is it a company? Sydneysiders will know the likely answer :-). How about "Mr William and Mrs Melinda Gates": is this "Mr William" and "Mrs Melinda Gates" or "Mr William Gates" and "Mrs Melinda Gates"?

        Update: No matter what (imperfect) heuristics you use for the address parsing and matching, you still face the infeasibility of comparing each address with every other one, as n becomes large. An obvious way to tackle that combinatorial explosion is to partition the addresses into "buckets" and compare only addresses within the same bucket. While various "bucketing" schemes are possible, the best/simplest I found was to bucket on postcode.

Re: Efficient Fuzzy Matching Of An Address
by jimX11 (Friar) on Aug 20, 2008 at 05:32 UTC

    We geo code all addresses then use a radius as the first filter then apply other heuristics to filter after that.

    Geo coding raises other issues. For example, some geo coding engines return multiple lat/long pairs (for one address).

    We deal with only about 2,500 addresses from 4 data sources with quarterly updates and about a 10% turnover rate. We use human review. We set a strict "similarity" level then generate a list of match candidates. Humans review this list and evaluate as a match or non-match. We then loosen the similarity requirement and generate another candidate list (but filter any non-matches found by the first filter). That candidate list is evaluated by humans. Loosen, (rinse) and repeat.

    After matching comes more fun, merging. Put another way, once you have a match that is not exact, how do you reconcile the differences? Given two differences, which is right?

      jimX11,
      I assume that if your geo coding fails, you have a human do review? In other words, someone provides an invalid address due to a typo (for instance).

      That is exactly what I am doing. I am just trying to make the human's job easier. I want them to pull out the most likely addresses that the input should have been.

      Cheers - L~R

        Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also.

        The candidate lists come from sql statements generated by perl code.

        Just for kicks, here's an example of the sql (that doesn't use a lat/long filter):

        insert into match_candidates (row1_id,row2_id,strat_name,status) select t1.row_id, t2.row_id, 'name_address_city_zip_county_units_weak_improved_trim_v5', case when t1.shim_id = t2.shim_id then 'good' else 'bad' end from merge_list t1, merge_list t2 where t1.row_id < t2.row_id -- looking for matches from the AHI-Doug source to some -- other source and ( (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc +e != t2.source) ) -- no duplicate matches from known phase things -- this can be weakened to show no previous -- matches from _any_ stratagy and ( not exists ( select * from match_candidates m where t1.row_id = m.row1_id and t2.row_id = m.row2_id ) ) -- strategy layers start here (assume at least one) AND -- similar name (weak) t1.name is not null and t2.name is not null and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.name),'[^a-z ]+','','g') or regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.name),'[^a-z ]+','','g') ) AND -- similar address (weak) t1.address is not null and t2.address is not null and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t2.address),'[^a-z ]+','','g') or regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t1.address),'[^a-z ]+','','g') ) AND -- similar city (weak) t1.city is not null and t2.city is not null and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.city),'[^a-z ]+','','g') or regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.city),'[^a-z ]+','','g') )
Re: Efficient Fuzzy Matching Of An Address
by eyepopslikeamosquito (Archbishop) on Aug 19, 2008 at 21:51 UTC
Re: Efficient Fuzzy Matching Of An Address
by Krambambuli (Curate) on Aug 20, 2008 at 10:32 UTC
    Since you seem to deal with US postal addresses, maybe you could consider CASS/MASS US Postal services for address standardization before doing further processing?

    Krambambuli
    ---
    fighting the sandals-without-socks dictature
Re: Efficient Fuzzy Matching Of An Address
by metaperl (Curate) on Aug 20, 2008 at 13:58 UTC
    Some comments:
    1. The general term for what you are doing overall is called record linkage.
    2. The particular phase you are trying optimize might be called candidate record pruning.
    3. I learned a lot by reading the docs for FEBRL. They used Markov Modelling to prune the search space. Also there are some good approximate matching methods there not implemented in pure Perl (such as Jaro-Winkler).
      Look at www.matchlogics.com
Re: Efficient Fuzzy Matching Of An Address
by Anonymous Monk on Aug 20, 2008 at 04:55 UTC
    Add a GPS coordinates for each address, do some math ?
      Anonymous Monk,
      Determining the long/lat coordinate for addresses in the DB is possible. The issue is that the submitted address will not contain a coordinate (not under my control). I could look it up but that gets to the crux of the problem - if the address isn't valid (typo, transcription problem, etc) - I won't have anything to do math on.

      Cheers - L~R

      Throw in some of that Google Cloud computing stuff for speed

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://705271]
Approved by grep
Front-paged by clinton
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 16:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found