Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
All,
Assume you have a database full of addresses.
123 Main St.
Somewhere, USA 12345
Assume that the address is stored in a number of forms:
- Exactly as entered
- Standardized via Lingua::EN::AddressParse
- Encoded as soundex via Text::Soundex (exact/standardized)
- Encoded as metaphone via Text::Metaphone (exact/standardized)
- Broken out into individual pieces (street, city, zip, etc)
- Possibly each individual piece encoded as described above (exact/standardized)
The problem I am trying to solve is this: Given an address as input, find any "similar" addresses in the DB without comparing every address (via String::Approx or Text::Levenshtein for instance). Using just the encoding routines alone seem to have a lot of false positives and false negatives.
Anyone have any expertise in the area have some advice? I am sorry, I don't have any real data I can share.
Re: Efficient Fuzzy Matching Of An Address
by BrowserUk (Patriarch) on Aug 19, 2008 at 18:10 UTC
|
I think the mechanism I describe in the subthread starting at Re^3: Comparing text documents could be adapted to this purpose depending upon what your actual goal is?
When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
BrowserUk,
When you say "Given an address as input, find any "similar" addresses in the DB" your not trying to (for example) locate next door neighbours, but rather locate duplicates with minor typos or transcription errors?
The latter (typos and transcription errors) with a caveat. If the input is '123 Main St' and the DB has a record of '356 Main Street', I would hope that the search wouldn't return "no results found". On the other hand, if the input was '12 Elm Ave' (two houses over), then the search should definately not find a match. Ultimately, I would like to return the best N matches ordered by degree of similarity.
Thanks for the pointer to Re^3: Comparing text documents. If I get real motivated, I will try it out.
| [reply] |
Re: Efficient Fuzzy Matching Of An Address
by eyepopslikeamosquito (Bishop) on Aug 19, 2008 at 18:34 UTC
|
I faced a similar problem years ago (the program was written in C) when processing millions of addresses to eliminate duplicates for a mailing house. It wasn't feasible to compare every address with every other address. What I ended up doing was partitioning the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket (i.e. with the same postcode) only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors.
| [reply] |
|
eyepopslikeamosquito,
What I ended up doing was breaking up the addresses into "buckets" by postcode and comparing all addresses with all other addresses in the same bucket only. To ensure reasonable reliability of postcodes, I had another program to compare suburb names with postcodes, looking for errors.
Let's say my input address is:
123 Broadway Av
Brownville, ME 04415
Let's say the correct best DB address to match is
123 Brodway Avenue
Brownville Junction, ME 04451
How would your approach work?
| [reply] |
|
It's a long time ago, but I'll try to remember how this address de-duplication program operated.
The addresses were first parsed into an "internal (canonical) form": for example, "Av" and "Avenue" above would parse to the same internal number. The "internal forms" were compared, not the original text. Addresses that were too weird and couldn't be parsed were rejected for inspection by a human. As part of the parsing, suburbs were checked against an Australian postcode database and errors reported, so the postcodes were cleaned (sometimes by a human). Suburbs were then ignored, postcodes were considered more reliable (people sometimes lie about their suburb name in Australia because some suburbs are "posher" than others).
There was also a user-configurable "number of transcription errors" allowed; if you allowed one transcription error, then "Broadway" and "Brodway" above would be considered a match. There was another parameter for transposition errors (e.g. "Gergory" versus "Gregory") and many other heuristics, like "Flat 6" versus "Flat 6A" versus "Apartment 6".
Update: Perhaps a better technique, which I didn't know about at the time and as suggested by metaperl, is Jaro-Winkler distance. I remember trying classical Knuth Soundex and found it to be not useful in this domain.
One of the many imperfect and sometimes amusing heuristics I remember was "Captain Cook Cruises": is this a gentleman with a title of Captain, first name Cook, last name Cruises, or is it a company? Sydneysiders will know the likely answer :-). How about "Mr William and Mrs Melinda Gates": is this "Mr William" and "Mrs Melinda Gates" or "Mr William Gates" and "Mrs Melinda Gates"?
Update: No matter what (imperfect) heuristics you use for the address parsing and matching, you still face the infeasibility of comparing each address with every other one, as n becomes large. An obvious way to tackle that combinatorial explosion is to partition the addresses into "buckets" and compare only addresses within the same bucket. While various "bucketing" schemes are possible, the best/simplest I found was to bucket on postcode.
| [reply] |
Re: Efficient Fuzzy Matching Of An Address
by jimX11 (Friar) on Aug 20, 2008 at 05:32 UTC
|
We geo code all addresses then use a radius as the first filter then apply other heuristics to filter after that.
Geo coding raises other issues. For example, some geo coding engines return multiple lat/long pairs (for one address).
We deal with only about 2,500 addresses from 4 data sources with quarterly updates and about a 10% turnover rate.
We use human review. We set a strict "similarity" level then generate a list of match candidates. Humans review this list and evaluate as a match or non-match. We then loosen the similarity requirement and generate another candidate list (but filter any non-matches found by the first filter). That candidate list is evaluated by humans. Loosen, (rinse) and repeat.
After matching comes more fun, merging. Put another way, once you have a match that is not exact, how do you reconcile the differences? Given two differences, which is right?
| [reply] |
|
| [reply] |
|
Yea, our geocoding process involves human review.
Properties that fail to geocode are reviewed and most often the
address is dirty and needs human cleaning.
Some geocoding engines assign what I view as
a geo-coding confidence level. Properties that
fall below a certain threshold are reviewed also.
The candidate lists come from sql statements generated by perl code. Just for kicks, here's an example of the sql (that doesn't use a lat/long filter):
insert into match_candidates (row1_id,row2_id,strat_name,status)
select t1.row_id,
t2.row_id,
'name_address_city_zip_county_units_weak_improved_trim_v5',
case
when t1.shim_id = t2.shim_id then 'good'
else 'bad'
end
from merge_list t1, merge_list t2
where t1.row_id < t2.row_id
-- looking for matches from the AHI-Doug source to some
-- other source
and (
(t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc
+e != t2.source)
)
-- no duplicate matches from known phase things
-- this can be weakened to show no previous
-- matches from _any_ stratagy
and ( not exists (
select *
from match_candidates m
where t1.row_id = m.row1_id
and t2.row_id = m.row2_id
)
)
-- strategy layers start here (assume at least one)
AND
-- similar name (weak)
t1.name is not null
and t2.name is not null
and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla
+ce(lower(t2.name),'[^a-z ]+','','g')
or
regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla
+ce(lower(t1.name),'[^a-z ]+','','g')
)
AND
-- similar address (weak)
t1.address is not null
and t2.address is not null
and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re
+place(lower(t2.address),'[^a-z ]+','','g')
or
regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re
+place(lower(t1.address),'[^a-z ]+','','g')
)
AND
-- similar city (weak)
t1.city is not null
and t2.city is not null
and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla
+ce(lower(t2.city),'[^a-z ]+','','g')
or
regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla
+ce(lower(t1.city),'[^a-z ]+','','g')
)
| [reply] [d/l] |
Re: Efficient Fuzzy Matching Of An Address
by eyepopslikeamosquito (Bishop) on Aug 19, 2008 at 21:51 UTC
|
| [reply] |
Re: Efficient Fuzzy Matching Of An Address
by Krambambuli (Curate) on Aug 20, 2008 at 10:32 UTC
|
| [reply] |
Re: Efficient Fuzzy Matching Of An Address
by metaperl (Curate) on Aug 20, 2008 at 13:58 UTC
|
Some comments:
- The general term for what you are doing overall is called record linkage.
- The particular phase you are trying optimize might be called candidate record pruning.
- I learned a lot by reading the docs for FEBRL. They used Markov Modelling to prune the search space. Also there are some good approximate matching methods there not implemented in pure Perl (such as Jaro-Winkler).
| [reply] |
|
Look at www.matchlogics.com
| [reply] |
Re: Efficient Fuzzy Matching Of An Address
by Anonymous Monk on Aug 20, 2008 at 04:55 UTC
|
Add a GPS coordinates for each address, do some math ? | [reply] |
|
Anonymous Monk,
Determining the long/lat coordinate for addresses in the DB is possible. The issue is that the submitted address will not contain a coordinate (not under my control). I could look it up but that gets to the crux of the problem - if the address isn't valid (typo, transcription problem, etc) - I won't have anything to do math on.
| [reply] |
|
Throw in some of that Google Cloud computing stuff for speed
| [reply] |
|
|