Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

(tye)Re: Algorythym for searching closest neighbor

by tye (Sage)
on Apr 03, 2001 at 20:00 UTC ( [id://69346]=note: print w/replies, xml ) Need Help??


in reply to Algorythym for searching closest neighbor

Well, if you really want closest as in "the fewest miles away", then you need to store some X,Y coordinates for every zip code (not just the ones currently in your database). Zip codes are organized such that ones that are numerically close are also fairly close physically but it is impossible to organize them such that those that are physically close are also always numerically close.

Anyway, once you have X,Y coordinates, efficiently finding the closest match can be tricky. One trick I developed long ago (which I'm curious if others have used since I've never run into it elsewhere) is to build a B-Tree key on int($X/$W),$Y (for example, on pack("NN",int($X/$W),$Y) or sprintf("%08d.%08d",$X/$W,$Y)) where $W is a "width" that is a little more than the distance you expect nearest matches to be from each other.

This sorts your data into bands of width $W. Then finding the closest match can be done with a few very fast queries. The exact algorythm is rather complex (if you really want to maximize the speed and if you can't guarantee a reasonable upper bound on the distance to the nearest match) but if you have a potentially huge number of points to search, this can be a big win.

        - tye (but my friends call me "Tye")

Replies are listed 'Best First'.
Re: (tye)Re: Algorythym for searching closest neighbor
by gryng (Hermit) on Apr 04, 2001 at 08:10 UTC
    Hey tye,

    You know I -love- to nitpick, discuss and generally rant on alogrithms (which never looks spelt correctly). So don't take anything below as an attack on you :) (which I know you aren't, I'm just saving myself from --'s! lol).

    Anyway the nearest neighbor problem can be an interesting one! There are two methods I'd like to mention for their various merits.

    The first is to use a Voronoi diagram. Voronoi diagrams just creates a list of connected verti that represent the furthest distance between two points. The simplest way to implement the construction of one of these, is to start with two items and draw a perpendicular line at the midpoint of the line they create. Now add an additional point randomly and follow the same method.

    The advantage here, is construction takes about n*log n time (though with random, you don't get nearly as good of performance as some other methods available, which are still n*log n though). Once constructed, the Voronoi diagram gives you lots of fun info. To determine the nearest neighbor, you simply have to find which 'cell' (enclosing vertex lines) you're in. To determine a point of maximal distance, look at the verti themselves.

    The other method is fairly simple. It's like your band-width method in approach. You keep the points in sorted order. You then find the closest two points left and right of the target. Find the distance between these two points and the target. Take the smaller and construct a square bounding box around the target of width twice that distance. Now linearly search through all points within that space. Quick, simple, and it works well with most data. (btw keeping points in sorted order makes for the fast lookup).

    I don't like the band-width method because it's just as complicated as the above method to get the correct answer (you don't get the correct answer just looking in that one band, you have to look in several most of the time), and not much faster except under weird circumstances. That is, spliting into band-widths helps decrease some of the search time, but it is also a more crude search since it isn't as accurate, and so you waste time removing the cruft returned (or blinding processing extra points).

    Welp, I'm ranting, tired and allergic.

    Ciao,
    Gryn :)

      Thanks. I mentioned it because I wanted feedback and I really appreciate yours.

      Note that I was forced to invent the banding method single-handedly about 12 years ago at the job I had at that time. It was chosen to work well with a relational database. I was quite happy with it.

      I was going to describe my approach in more detail but found that new details just kept cropping up so I just glossed over it. For example, you have to deal with the case when there are no other points in your band. You can also choose to use a bounding circle rather than square in order to trade a few floating point calculations (and more complex code) in hopes of having to read fewer records. My coordinates were integers so I used a bounding square but I I adjusted it as I went (whenever I found a new closest point). And it takes some careful factoring of the problem to make the code for this elegant.

      Anyway, I didn't understand your brief explanation of Voronoi so I hit http://www.google.com/ and came up with a Java applet that draws Voronoi diagrams. That made the basic idea quite clear rather quickly. I'm still pretty fuzzy on the details of the algorithm, but I didn't have the time to study it. (:

      It looks like Voronoi would present quite a challange for a relational database. The added complexity for the banding method is "add one index" (for Voronoi I'd think you'd probably need three new tables). The update method for the banding method is "insert one record" (for Voronoi you'd have to update all surrounding polygons).

      Thanks for the info on Voronoi diagrams. I'll certainly remember that and will dig into them more in the future.

      Another aspect of my problem that I solved with the banding method (since at that point I had the code for it written and working) was to find what "zone" a point fell in. Divide a 2-D space into polygons (each called a "zone") and then, given a point, find which zone it belongs to. Again, to make things easy to implement in a relational database, I broke each polygon into rectangles and triangles with all but the hypotenuses being either horizontal or veritcal. Then I built an index on int(log(max(width,height))).int(x/w).y and used the banding method iterating over int(log(size)) until I found a shape that contained the point.

      I wasn't as happy with this approach but it work pretty well. Actually, the one table contained multiple pavings for different zone "types" and the algorithm would search for the matching zone of each of several types at once.

      But I don't have a point; just "stream of consciousness" on the problem. (:

              - tye (but my friends call me "Tye")
        Hey!

        Well I'm glad I helped out at least a little. I reread last nights post (I'm feeling -much- better now after 8 hours of sleep) and I agree, my description was mostly incomprehensible. I think I needed a picture or a phD last night to do better :) , sorry.

        I definitely agree, that a voronoi diagram isn't suited for a relational database. At least, you shouldn't put your code in SQL or some such pseudo-language (hmm, that -wasn't- flame bait, I'm having to do a website at work, for the second time, in SQL ...shudder...). But I think it's easy enough to put the voronoi data in the database, and then use something secondary (hmm, perl? :) ) to process it.

        Anyway the other method I described seems to be halfway between yours and the voronoi. But, your method seems -very- complicated. It seems that you -might- have been too worried about trimming tests than doing tests (I don't know for certain since I don't have the actual code in front of me :) ). With the simple method you do a -very- quick test, and you've limited the search space to, hopefully, much less than a hunderd or so points (probably just ten or so). If you want to guarantee even less points, do a up-down as well as a left-right initial search. Anyway, you do some quick tests, and you can limit the tests to a reasonable amount. Then you test, you can't be afraid of testing, or you might spend more time avoiding tests than what you save :) .

        Time for work!

        Ciao,
        Gryn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://69346]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-03-28 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found