Honestly I think that a huge part of this problem could be solved by using the Database as effectively as possible. I am hoping that you are using a standard DB such as Oracle, MySQL, etc. The info, then, that you want to search through can be sorted using an "ORDER BY" statement on the address number and street name. This will give a sorted list that will basically group all the dupes next to each other.
As for the rest, all you need to do is match address (number + street name) and make a hash or some other data structure to store the accepted spellings of common street names or other address conventions. (such ast st. blvd. etcetera). You can have the program run through this hash and transform them to a common output and you should get duplicate output, with different primary keys.
This is a general overview of your problem. There are obviously going to be some edge cases come up when you tackle this problem. This is quite a large problem, but I think that if you are able to have the DB work for you it will simplify things tremendously.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|