Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Algorythym for searching closest neighbor

by belize (Deacon)
on Apr 03, 2001 at 16:15 UTC ( #69272=perlquestion: print w/replies, xml ) Need Help??

belize has asked for the wisdom of the Perl Monks concerning the following question:

I am in the process of designing a search mechanism to find the closest matching ZIP code from a database. I've searched this site but have found nothing relating to ZIP codes.

The only way I can figure to date to accomplish this is:

  • First search for a exact match of all five numbers
  • If nothing is returned, search for an exact match of the first 4 numbers
  • Finally, if nothing is returned, try for an exact match of the first 3 numbers
This may be too elementary, but just wondering if there is a more elegant solution.
  • Comment on Algorythym for searching closest neighbor

Replies are listed 'Best First'.
(Ovid - zip code distance) Re: Algorythym for searching closest neighbor
by Ovid (Cardinal) on Apr 03, 2001 at 16:42 UTC

    If you're looking for the closest neighboring zip code (as opposed to two zip codes that closely match), you really have no guarantee that two zips that are close to one another numerically are going to suit your needs.

    This thread has info about determining how far apart two zip codes are. There's even some sample code for you.


    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

(boo)Re: Algorythym for searching closest neighbor
by boo_radley (Parson) on Apr 03, 2001 at 18:26 UTC
    As usual, investigating something relatively mundane yields some fascinating information. Any zip code database you come across in the next few months/years will probably be out of date -- zip codes are offically "mapped" by the census, and the Census Bureau hasn't completed last year's -- and it seems like the results of the survey are accurate to a county level.

    The post office, naturally has their own interface which you can automate through LWP trickery. They indicate that the longest their data will rot is 2 months

    If this is for big commercial use type work, maybe you wanna buy the TIGER/ZIP+4 database the USPS offer -- it cross-references ZIP code to geographic data.

    The odd thing is that every database ( including the unfortunately named MAGGOT) warns that any data you get from any source is probably inaccurate to a (varying) degree.

    Anyway, with the wide variety of possible ways to manipulate a ZIP code, here's one way to fulfill your original request :

    #usr/bin/perl -w use strict; my $zip = "12345"; my @accepted = qw (12390 12351 12345 01241); foreach my $thiszip (@accepted){ print "Matched ", cmpZIP ($zip, $thiszip), "\n"; } sub cmpZIP { my ($zip_a,$zip_b) = @_; if ($zip_a eq $zip_b){return 5} my ($ichi, $ni, $san)=unpack ("a3aa",$zip_a); my ($ein, $zwei ,$drei)=unpack ("a3aa",$zip_b); if (($ichi eq $ein)&&($zwei eq $ni)) {return 4} if ($ichi eq $ein) {return 3} return 0; }
    Although, after reading up on the subject, you may want to change this to a2a3 or a2aaa to reflect the state/county encoding in the ZIP.
Re: Algorythym for searching closest neighbor
by THRAK (Monk) on Apr 03, 2001 at 16:28 UTC
    I don't necessarily see this as a better idea, but possibly an alternative. Do the 5 digit search first and if not found then build a 5 digit list of first 3 digit matches. Why? Divide an conqueor. Then parse the 3 digit match list for 4 and 2 digit matches at the same time. When you get to the end, you now have a list of 2, 3 and 4 digit matches (if there were any). Another approach would be to do this sort of thing in a single pass and match 2 - 5 digits and find the closest match from those. It's a trade off between speed (going to take longer to do all the possibilities at once) versus the odds that someone will enter number that has zero or few (1-2) beginning matches. I'm sure somebody else will have a better answer for you, so take this for what you will.

(tye)Re: Algorythym for searching closest neighbor
by tye (Sage) on Apr 03, 2001 at 20:00 UTC

    Well, if you really want closest as in "the fewest miles away", then you need to store some X,Y coordinates for every zip code (not just the ones currently in your database). Zip codes are organized such that ones that are numerically close are also fairly close physically but it is impossible to organize them such that those that are physically close are also always numerically close.

    Anyway, once you have X,Y coordinates, efficiently finding the closest match can be tricky. One trick I developed long ago (which I'm curious if others have used since I've never run into it elsewhere) is to build a B-Tree key on int($X/$W),$Y (for example, on pack("NN",int($X/$W),$Y) or sprintf("%08d.%08d",$X/$W,$Y)) where $W is a "width" that is a little more than the distance you expect nearest matches to be from each other.

    This sorts your data into bands of width $W. Then finding the closest match can be done with a few very fast queries. The exact algorythm is rather complex (if you really want to maximize the speed and if you can't guarantee a reasonable upper bound on the distance to the nearest match) but if you have a potentially huge number of points to search, this can be a big win.

            - tye (but my friends call me "Tye")
      Hey tye,

      You know I -love- to nitpick, discuss and generally rant on alogrithms (which never looks spelt correctly). So don't take anything below as an attack on you :) (which I know you aren't, I'm just saving myself from --'s! lol).

      Anyway the nearest neighbor problem can be an interesting one! There are two methods I'd like to mention for their various merits.

      The first is to use a Voronoi diagram. Voronoi diagrams just creates a list of connected verti that represent the furthest distance between two points. The simplest way to implement the construction of one of these, is to start with two items and draw a perpendicular line at the midpoint of the line they create. Now add an additional point randomly and follow the same method.

      The advantage here, is construction takes about n*log n time (though with random, you don't get nearly as good of performance as some other methods available, which are still n*log n though). Once constructed, the Voronoi diagram gives you lots of fun info. To determine the nearest neighbor, you simply have to find which 'cell' (enclosing vertex lines) you're in. To determine a point of maximal distance, look at the verti themselves.

      The other method is fairly simple. It's like your band-width method in approach. You keep the points in sorted order. You then find the closest two points left and right of the target. Find the distance between these two points and the target. Take the smaller and construct a square bounding box around the target of width twice that distance. Now linearly search through all points within that space. Quick, simple, and it works well with most data. (btw keeping points in sorted order makes for the fast lookup).

      I don't like the band-width method because it's just as complicated as the above method to get the correct answer (you don't get the correct answer just looking in that one band, you have to look in several most of the time), and not much faster except under weird circumstances. That is, spliting into band-widths helps decrease some of the search time, but it is also a more crude search since it isn't as accurate, and so you waste time removing the cruft returned (or blinding processing extra points).

      Welp, I'm ranting, tired and allergic.

      Gryn :)

        Thanks. I mentioned it because I wanted feedback and I really appreciate yours.

        Note that I was forced to invent the banding method single-handedly about 12 years ago at the job I had at that time. It was chosen to work well with a relational database. I was quite happy with it.

        I was going to describe my approach in more detail but found that new details just kept cropping up so I just glossed over it. For example, you have to deal with the case when there are no other points in your band. You can also choose to use a bounding circle rather than square in order to trade a few floating point calculations (and more complex code) in hopes of having to read fewer records. My coordinates were integers so I used a bounding square but I I adjusted it as I went (whenever I found a new closest point). And it takes some careful factoring of the problem to make the code for this elegant.

        Anyway, I didn't understand your brief explanation of Voronoi so I hit and came up with a Java applet that draws Voronoi diagrams. That made the basic idea quite clear rather quickly. I'm still pretty fuzzy on the details of the algorithm, but I didn't have the time to study it. (:

        It looks like Voronoi would present quite a challange for a relational database. The added complexity for the banding method is "add one index" (for Voronoi I'd think you'd probably need three new tables). The update method for the banding method is "insert one record" (for Voronoi you'd have to update all surrounding polygons).

        Thanks for the info on Voronoi diagrams. I'll certainly remember that and will dig into them more in the future.

        Another aspect of my problem that I solved with the banding method (since at that point I had the code for it written and working) was to find what "zone" a point fell in. Divide a 2-D space into polygons (each called a "zone") and then, given a point, find which zone it belongs to. Again, to make things easy to implement in a relational database, I broke each polygon into rectangles and triangles with all but the hypotenuses being either horizontal or veritcal. Then I built an index on int(log(max(width,height))).int(x/w).y and used the banding method iterating over int(log(size)) until I found a shape that contained the point.

        I wasn't as happy with this approach but it work pretty well. Actually, the one table contained multiple pavings for different zone "types" and the algorithm would search for the matching zone of each of several types at once.

        But I don't have a point; just "stream of consciousness" on the problem. (:

                - tye (but my friends call me "Tye")
Re: Algorythm for searching closest neighbor
by one4k4 (Hermit) on Apr 03, 2001 at 16:27 UTC
    Well, if you're using a DB (hopefully SQL, at least for this code) you could use a LIKE statement in there.

    SELECT FROM zipdb zdb WHERE LIKE '%@query%'
    I belive this may help. The %'s round the @query say match this in the beginning, middle, or end. I could be wrong, so look it up. (sybase is what I use)
    Just my .02

    _14k4 - (
      Well your select will find any number which contains your @query. So you will find
      @query=54321 654321 or 543210 but you will not find 54320 for example
      but when it's only numerical zip maybe the following will work:
      select min(abs(zip-@query)) from zipcodetable
      ----------------------------------- --the good, the bad and the physi-- -----------------------------------
Re: Algorythym for searching closest neighbor
by dws (Chancellor) on Apr 04, 2001 at 02:38 UTC
    Some commercial zipcode databases include longitude and lattitude. If you get your hands on one of these, you'll want to preprocess it build a short list of closest zipcodes for each zipcode.

    For what (little) research I did on this a few years back, is seems that using a partial match on the zipcode is a good, but not perfect, heuristic for finding close zipcodes. Were I to do this now, I'd use longitude and lattitude to map zipcodes to "hash" zipcodes into a grid that holds an array of zipcode entries. Then you can limit the search to that grid cell plus nearby adjoining grid cells.

Re: Algorythm for searching closest neighbor
by gryng (Hermit) on Apr 04, 2001 at 02:31 UTC
    I'm not sure what you want -- closest digits, or closest location? But for closest digits, in a left right precedence, all you need is a sorted list of zip codes. Then you do a binary search for the number. If it succeeds, good. If not, then where it ends will either be the closest number or else the one above or below it will be.

    If you need closest location then that can be done quickly too, but you need the locations first.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://69272]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2021-10-27 01:51 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (91 votes). Check out past polls.