Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Efficient Fuzzy Matching Of An Address

by jimX11 (Friar)
on Aug 20, 2008 at 05:32 UTC ( [id://705413]=note: print w/replies, xml ) Need Help??


in reply to Efficient Fuzzy Matching Of An Address

We geo code all addresses then use a radius as the first filter then apply other heuristics to filter after that.

Geo coding raises other issues. For example, some geo coding engines return multiple lat/long pairs (for one address).

We deal with only about 2,500 addresses from 4 data sources with quarterly updates and about a 10% turnover rate. We use human review. We set a strict "similarity" level then generate a list of match candidates. Humans review this list and evaluate as a match or non-match. We then loosen the similarity requirement and generate another candidate list (but filter any non-matches found by the first filter). That candidate list is evaluated by humans. Loosen, (rinse) and repeat.

After matching comes more fun, merging. Put another way, once you have a match that is not exact, how do you reconcile the differences? Given two differences, which is right?

  • Comment on Re: Efficient Fuzzy Matching Of An Address

Replies are listed 'Best First'.
Re^2: Efficient Fuzzy Matching Of An Address
by Limbic~Region (Chancellor) on Aug 20, 2008 at 12:32 UTC
    jimX11,
    I assume that if your geo coding fails, you have a human do review? In other words, someone provides an invalid address due to a typo (for instance).

    That is exactly what I am doing. I am just trying to make the human's job easier. I want them to pull out the most likely addresses that the input should have been.

    Cheers - L~R

      Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also.

      The candidate lists come from sql statements generated by perl code.

      Just for kicks, here's an example of the sql (that doesn't use a lat/long filter):

      insert into match_candidates (row1_id,row2_id,strat_name,status) select t1.row_id, t2.row_id, 'name_address_city_zip_county_units_weak_improved_trim_v5', case when t1.shim_id = t2.shim_id then 'good' else 'bad' end from merge_list t1, merge_list t2 where t1.row_id < t2.row_id -- looking for matches from the AHI-Doug source to some -- other source and ( (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc +e != t2.source) ) -- no duplicate matches from known phase things -- this can be weakened to show no previous -- matches from _any_ stratagy and ( not exists ( select * from match_candidates m where t1.row_id = m.row1_id and t2.row_id = m.row2_id ) ) -- strategy layers start here (assume at least one) AND -- similar name (weak) t1.name is not null and t2.name is not null and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.name),'[^a-z ]+','','g') or regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.name),'[^a-z ]+','','g') ) AND -- similar address (weak) t1.address is not null and t2.address is not null and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t2.address),'[^a-z ]+','','g') or regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t1.address),'[^a-z ]+','','g') ) AND -- similar city (weak) t1.city is not null and t2.city is not null and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.city),'[^a-z ]+','','g') or regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.city),'[^a-z ]+','','g') )

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://705413]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found