Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: De Duping Street Addresses Fuzzily

by Limbic~Region (Chancellor)
on Feb 01, 2005 at 01:01 UTC ( [id://426776]=note: print w/replies, xml ) Need Help??


in reply to De Duping Street Addresses Fuzzily

patrickrock,
I am surprised no one has mentioned Lingua::EN::AddressParse yet. It will not be a 100% solution. As indicated elsewhere in this thread, commercial products such as Group 1 are really good at this. Using the module though, you can probably reduce the amount of work that needs to be done by hand to about 10%. You could use something like Geo::Coder::US to help determine if the address you have is actually valid. A bit more research on CPAN might turn up even more goodies.

Cheers - L~R

On further review of this module, it appears the Parse::RecDescent grammar for US addresses could use some TLC. The author, Kim Ryan, appears to be from down under and complex US addresses don't seem to get parsed correctly. I bet someone here can improve it though ;-)

Replies are listed 'Best First'.
Re^2: De Duping Street Addresses Fuzzily
by Anonymous Monk on Feb 01, 2005 at 01:46 UTC
    Ditto for the suggestion of using Lingua::EN::AddressParse. Even if the high level methods don't suit your exact problem space, you may find the back-end code and normalisations useful in putting your data in a more consistent format before sorting/processing. It also handles a number of special cases that occur in the real world.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://426776]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 11:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found