comment on

All,

Assume you have a database full of addresses.
123 Main St.
Somewhere, USA 12345

Assume that the address is stored in a number of forms:

Exactly as entered
Standardized via Lingua::EN::AddressParse
Encoded as soundex via Text::Soundex (exact/standardized)
Encoded as metaphone via Text::Metaphone (exact/standardized)
Broken out into individual pieces (street, city, zip, etc)
Possibly each individual piece encoded as described above (exact/standardized)

The problem I am trying to solve is this: Given an address as input, find any "similar" addresses in the DB without comparing every address (via String::Approx or Text::Levenshtein for instance). Using just the encoding routines alone seem to have a lot of false positives and false negatives.

Anyone have any expertise in the area have some advice? I am sorry, I don't have any real data I can share.

Cheers - L~R

In reply to Efficient Fuzzy Matching Of An Address by Limbic~Region

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Welcome to the Monastery
	PerlMonks