Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Honestly I think that a huge part of this problem could be solved by using the Database as effectively as possible. I am hoping that you are using a standard DB such as Oracle, MySQL, etc. The info, then, that you want to search through can be sorted using an "ORDER BY" statement on the address number and street name. This will give a sorted list that will basically group all the dupes next to each other.

As for the rest, all you need to do is match address (number + street name) and make a hash or some other data structure to store the accepted spellings of common street names or other address conventions. (such ast st. blvd. etcetera). You can have the program run through this hash and transform them to a common output and you should get duplicate output, with different primary keys.

This is a general overview of your problem. There are obviously going to be some edge cases come up when you tackle this problem. This is quite a large problem, but I think that if you are able to have the DB work for you it will simplify things tremendously.

In reply to Re: De Duping Street Addresses Fuzzily by Grundle
in thread De Duping Street Addresses Fuzzily by patrickrock

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (None)
    As of 2024-04-25 03:56 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found