Efficient Fuzzy Matching Of An Address

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient Fuzzy Matching Of An Address by BrowserUk (Patriarch) on Aug 19, 2008 at 18:10 UTC
I think the mechanism I describe in the subthread starting at Re^3: Comparing text documents could be adapted to this purpose depending upon what your actual goal is? When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Limbic~Region (Chancellor) on Aug 19, 2008 at 18:22 UTC
BrowserUk, When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors? The latter (typos and transcription errors) with a caveat. If the input is '123 Main St' and the DB has a record of '356 Main Street', I would hope that the search wouldn't return "no results found". On the other hand, if the input was '12 Elm Ave' (two houses over), then the search should definately not find a match. Ultimately, I would like to return the best N matches ordered by degree of similarity. Thanks for the pointer to Re^3: Comparing text documents. If I get real motivated, I will try it out. Cheers - L~R	[reply]
Re: Efficient Fuzzy Matching Of An Address by eyepopslikeamosquito (Archbishop) on Aug 19, 2008 at 18:34 UTC
I faced a similar problem years ago (the program was written in C) when processing millions of addresses to eliminate duplicates for a mailing house. It wasn't feasible to compare every address with every other address. What I ended up doing was partitioning the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket (i.e. with the same postcode) only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors.	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Limbic~Region (Chancellor) on Aug 19, 2008 at 18:38 UTC
eyepopslikeamosquito, What I ended up doing was breaking up the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors. Let's say my input address is: 123 Broadway Av Brownville, ME 04415 Let's say the ~~correct~~ best DB address to match is 123 Brodway Avenue Brownville Junction, ME 04451 How would your approach work? Cheers - L~R	[reply]
Re^3: Efficient Fuzzy Matching Of An Address by eyepopslikeamosquito (Archbishop) on Aug 19, 2008 at 19:22 UTC
It's a long time ago, but I'll try to remember how this address de-duplication program operated. The addresses were first parsed into an "internal (canonical) form": for example, "Av" and "Avenue" above would parse to the same internal number. The "internal forms" were compared, not the original text. Addresses that were too weird and couldn't be parsed were rejected for inspection by a human. As part of the parsing, suburbs were checked against an Australian postcode database and errors reported, so the postcodes were cleaned (sometimes by a human). Suburbs were then ignored, postcodes were considered more reliable (people sometimes lie about their suburb name in Australia because some suburbs are "posher" than others). There was also a user-configurable "number of transcription errors" allowed; if you allowed one transcription error, then "Broadway" and "Brodway" above would be considered a match. There was another parameter for transposition errors (e.g. "Gergory" versus "Gregory") and many other heuristics, like "Flat 6" versus "Flat 6A" versus "Apartment 6". Update: Perhaps a better technique, which I didn't know about at the time and as suggested by metaperl, is Jaro-Winkler distance. I remember trying classical Knuth Soundex and found it to be not useful in this domain. One of the many imperfect and sometimes amusing heuristics I remember was "Captain Cook Cruises": is this a gentleman with a title of Captain, first name Cook, last name Cruises, or is it a company? Sydneysiders will know the likely answer :-). How about "Mr William and Mrs Melinda Gates": is this "Mr William" and "Mrs Melinda Gates" or "Mr William Gates" and "Mrs Melinda Gates"? Update: No matter what (imperfect) heuristics you use for the address parsing and matching, you still face the infeasibility of comparing each address with every other one, as n becomes large. An obvious way to tackle that combinatorial explosion is to partition the addresses into "buckets" and compare only addresses within the same bucket. While various "bucketing" schemes are possible, the best/simplest I found was to bucket on postcode.	[reply]
Re: Efficient Fuzzy Matching Of An Address by jimX11 (Friar) on Aug 20, 2008 at 05:32 UTC
We geo code all addresses then use a radius as the first filter then apply other heuristics to filter after that. Geo coding raises other issues. For example, some geo coding engines return multiple lat/long pairs (for one address). We deal with only about 2,500 addresses from 4 data sources with quarterly updates and about a 10% turnover rate. We use human review. We set a strict "similarity" level then generate a list of match candidates. Humans review this list and evaluate as a match or non-match. We then loosen the similarity requirement and generate another candidate list (but filter any non-matches found by the first filter). That candidate list is evaluated by humans. Loosen, (rinse) and repeat. After matching comes more fun, merging. Put another way, once you have a match that is not exact, how do you reconcile the differences? Given two differences, which is right?	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Limbic~Region (Chancellor) on Aug 20, 2008 at 12:32 UTC
jimX11, I assume that if your geo coding fails, you have a human do review? In other words, someone provides an invalid address due to a typo (for instance). That is exactly what I am doing. I am just trying to make the human's job easier. I want them to pull out the most likely addresses that the input should have been. Cheers - L~R	[reply]
Re^3: Efficient Fuzzy Matching Of An Address by jimX11 (Friar) on Aug 20, 2008 at 20:46 UTC
Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also. The candidate lists come from sql statements generated by perl code. Just for kicks, here's an example of the sql (that doesn't use a lat/long filter): insert into match_candidates (row1_id,row2_id,strat_name,status) select t1.row_id, t2.row_id, 'name_address_city_zip_county_units_weak_improved_trim_v5', case when t1.shim_id = t2.shim_id then 'good' else 'bad' end from merge_list t1, merge_list t2 where t1.row_id < t2.row_id -- looking for matches from the AHI-Doug source to some -- other source and ( (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc +e != t2.source) ) -- no duplicate matches from known phase things -- this can be weakened to show no previous -- matches from _any_ stratagy and ( not exists ( select * from match_candidates m where t1.row_id = m.row1_id and t2.row_id = m.row2_id ) ) -- strategy layers start here (assume at least one) AND -- similar name (weak) t1.name is not null and t2.name is not null and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.name),'[^a-z ]+','','g') or regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.name),'[^a-z ]+','','g') ) AND -- similar address (weak) t1.address is not null and t2.address is not null and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t2.address),'[^a-z ]+','','g') or regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t1.address),'[^a-z ]+','','g') ) AND -- similar city (weak) t1.city is not null and t2.city is not null and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.city),'[^a-z ]+','','g') or regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.city),'[^a-z ]+','','g') ) [download]	[reply] [d/l]
Re: Efficient Fuzzy Matching Of An Address by eyepopslikeamosquito (Archbishop) on Aug 19, 2008 at 21:51 UTC
See also De Duping Street Addresses Fuzzily.	[reply]
Re: Efficient Fuzzy Matching Of An Address by Krambambuli (Curate) on Aug 20, 2008 at 10:32 UTC
Since you seem to deal with US postal addresses, maybe you could consider CASS/MASS US Postal services for address standardization before doing further processing? Krambambuli --- fighting the sandals-without-socks dictature	[reply]
Re: Efficient Fuzzy Matching Of An Address by metaperl (Curate) on Aug 20, 2008 at 13:58 UTC
Some comments: The general term for what you are doing overall is called record linkage. The particular phase you are trying optimize might be called candidate record pruning. I learned a lot by reading the docs for FEBRL. They used Markov Modelling to prune the search space. Also there are some good approximate matching methods there not implemented in pure Perl (such as Jaro-Winkler).	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Anonymous Monk on Apr 25, 2009 at 13:50 UTC
Look at www.matchlogics.com	[reply]
Re: Efficient Fuzzy Matching Of An Address by Anonymous Monk on Aug 20, 2008 at 04:55 UTC
Add a GPS coordinates for each address, do some math ?	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Limbic~Region (Chancellor) on Aug 20, 2008 at 12:28 UTC
Anonymous Monk, Determining the long/lat coordinate for addresses in the DB is possible. The issue is that the submitted address will not contain a coordinate (not under my control). I could look it up but that gets to the crux of the problem - if the address isn't valid (typo, transcription problem, etc) - I won't have anything to do math on. Cheers - L~R	[reply]
Re^2: Efficient Fuzzy Matching Of An Address by Anonymous Monk on Aug 20, 2008 at 05:05 UTC
Throw in some of that Google Cloud computing stuff for speed	[reply]


We don't bite newbies here... much
	PerlMonks