Re: Speeding up ... (O(N) determination of point in polygon regardless of complexity)

In Speeding up point-in-polygon -- take two, punkish was looking for ways of speeding up point in polygon calculations. Various ordering and searching methods come to mind which improve the performance by preselecting the polygons against which each point is compared. Combined with performing the math in C, these techniques can reasonably be expected to improve the performance by up to an order of magnitude.

However, I learned long ago that the quickest way to do something is to avoid doing it at all. And in the case of computer math, if you can avoid doing any math at all, you'll definitely speed things up. A common technique for avoiding math, especially time expensive math is to trade memory for speed and use a lookup table. Eg. A lookup table of logarithms as those of use who used log tables at school will remember, is much faster than evaluating the polynomial expression required to calculate them.

For the point in polygon problem of determining which zip-code a particular point falls in, the simplest way of doing this is to construct a 2-dimensional lookup table of the complex polygons representing each zip-code area and store the zip-code (or a numeric value that represents the zip-code), at the crossover of each pair of coordinates. Determining the zip-code from a point then becomes a simple task of retrieving the value stored at the point index by the (X,Y) pair of the point in question.

My solution to this problem recognises that an image-file in which each of the polygons has been plotted, each using a unique colour, is simple a large scale map. It makes the lookup a simple process of loading the image and doing a getPixel( x,y ) of the relevant point. The color (index) returned is a direct representation of the zip-code at that point on the map.

The nice thing about this process is that it does not matter how complex your polygons are, the lookup is still O(n) in the number of points to be looked up. Obviously, for a map the size of the US, some scaling is required to achieve the required accuracy.

Note: Given accurate enough image plotting, this even caters for the case of a single street (red line in the purple area), having a different zip-code to the area surrounding it on all sides.

Accuracy & scale

Depending upon the required accuracy, a single image file sufficient to cover the entire USA, may be impractical to load into memory. In this case, it may be necessary to split the map into 4 or 16 ro 64 or some other convenient number of smaller images. You would then pre-order the points so as to minimise the need to switch images. Ie. You load the first image and only look up points that fall within it's boundaries before loading the second image.

My proof-of-concept code consisted of looking up 5.25e6 randomly generated points in a single 5000x5000 image containing 2.5e5 polygons. It did this in around 35 seconds on my machine which makes it 50,000 times faster than the OP's method. Even if 256 separate images have to be loaded, it will still be at least two order of magnitude faster.

For ease of generation, the 'polygons' I used were regularly spaced, equal-sized circles, but as noted earlier, once plotted, neither the number nor complexity of the polygons has any affect upon the performance of the method.

Proof-of-concept code

Image generator (polygon plotter)

I've incremented the color used for each polygon by 64 (256³ / 250,000 = 67.1) in an attempt to produce a visually pleasing map, but in reality the map does not need to be visually pleasing so a standard by-1 increment could be used which would allow a 24-bit image to handle up to 16 million polygons with no affect on performance.

#! perl -slw
use strict;
use GD;

my $map = GD::Image->new( 5000, 5000, 1 )
    or die $!;

my $index = 0;

my $bg = $map->colorAllocate( 0, 0, 0 );
$map->filledRectangle( 0, 0, 4999, 4999, $bg );

for my $y ( map{ $_ *10 +5 } 0 .. 499 ) {
    for my $x ( map{ $_ * 10 + 5 } 0 .. 499 ) {
        my $color = $map->colorAllocate( unpack 'xCCC', pack 'N', $ind
+ex += 64 );
#        print $color;
        $map->filledEllipse( $x, $y, 10, 10, $color );
    }
}

open IMG, '>:raw', '569929.png' or die $!;
print IMG $map->png;
close IMG;
[download]

Random point generator

#! perl -slw
use strict;

open POINTS, '>', '569929.points' or die $!;
printf POINTS "%d:%d\n", int( rand 5000 ), int( rand 5000 ) for 1 .. 5
+.25e6;
close POINTS;
[download]

Lookup process benchmark

#! perl -slw
use strict;
use Benchmark::Timer;
use GD;


my $map = GD::Image->newFromPng( '569929.png', 1 );
 open POINTS, '<', '569929.points' or die $!;

my $T = new Benchmark::Timer;
$T->start( '5.25e6 points in 2.5e5 polys' );

while( <POINTS> ) {
    chomp;
    my( $x, $y ) = split ':';
    my $poly = $map->getPixel($x, $y );
#    printf "Point ( %4.4d, %4.4d ) found in poly: %d\n", $x, $y, $pol
+y / 64;
}

$T->stop( '5.25e6 points in 2.5e5 polys' );
$T->report;

close POINTS;
[download]

Timings of several runs

c:\test>569929-b
1 trial of 5,25e6 points in 2.5e5 polys (33.094s total)


c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (36.656s total)


c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (33.250s total)


c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (35.094s total)

c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (33.734s total)
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Speeding up ... (O(N) determination of point in polygon regardless of complexity) Select or Download Code

Replies are listed 'Best First'.
Re^2: Speeding up ... (O(N) determination of point in polygon regardless of complexity) by punkish (Priest) on Aug 29, 2006 at 03:16 UTC
I have been thinking about BrowserUk's proposed solution, and have concluded this to be one of the most innovative strategies I have seen in a long time. Using a png essentially as a lookup table... even though I haven't tried the solution yet, I have to say "wow! Bravo!" Well, I have been thinking about it, and there are a few quibbles, and possible mods -- As noted in the solution, a single image to cover the entire country would be simply too big, and would have to be chopped up. That creates problem of juggling images in memory, as there is really no easy way to sort the points and process them in image-based batches. The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up. If I could figure out how to generate all the pixel coords for a polygon, I wouldn't even need GD. I could simply store each pixel (something akin to an INTEGER UNIQUE NOT NULL would work just fine) and its associated attribute in a Berkeley Db hash. The advantage of Bdb would be that I wouldn't have to worry about image file sizes and chunking up images as in the GD approach. *So, does anyone know of a way to generate all the coord pairs for a given polygon and a resolution?* By the way, I tested my approach posted in the OP that started this thread. Not including the data prep, it took me about 3.25 hours to update 5.25 million points against 210k polys. Nevertheless, compliments to BrowserUk for a truly innovative solution. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re^3: Speeding up ... (O(N) determination of point in polygon regardless of complexity) by BrowserUk (Patriarch) on Aug 29, 2006 at 04:14 UTC
as there is really no easy way to sort the points and process them in image-based batches. Assuming your zipcode polygons are obtained from somewhere like here, the each of the .e00 files contains the polygons for the zipcodes in each state. By building one image per state you will have 48/50 images and 48/50 bounding boxes. Pre-selecting or ordering the points by those bounding boxes (much as you currently do using the bounding box of the individual polys0, is simple, and as there are many fewer states than zips, much quicker. If the points originate in a DB, then something like `select x,y from points where x >= state.minx and x <= state.maxx and y >= state.miny and y <= state.maxy;` [download] would allow you to load the state images one at a time and select only the points (roughly) within state for lookup. The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up. Taking Texas as an example (guessing that it is the largest state from looking at a map), width covers 13 degrees of Longitude and 773 miles. That means that each arc minute represents ~1 mile and each arc second ~88 feet. In decimal degrees, each 0.0001 of a degree represents ~32 feet. (Check my math, it's getting late!) So, if you represent the biggest state by an image 13000x11000 pixels, and multiply all the coordinates in your polys by 10,000 and truncate, each pixels will represent ~32ft X 32 feet. The image takes under 600MB when loaded in memory. If you have fully populated hardware, you could perhaps increase your resolution by a factor of 10 if need be. Remember that the majority of states are considerably smaller. There will obviously be some overlap between bounding boxes, which means some points will be lookup in 2 or 3 maps, but it is convenient that in large part, most state boundaries follow lines of longitude and latitude. There is even a module Geo::E00 for processing these files :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Speeding up ... (O(N) determination of point in polygon regardless of complexity) (Bresenham) by tye (Sage) on Aug 29, 2006 at 03:44 UTC
Converting a polygon to pixels is something that graphics libraries are highly optimized at. So I'd go with the graphics library approach. It may be as simple as drawing a border around each polygon in white. If your lookup finds a white pixel, then look at nearby pixels to find the colors of polygons to test the 'long' way. Then you don't have to have a really high-resolution image. If you've got very thin strips, then that may require special handling. Assigning one more color to each polygon that has such should cover it, though (that color means try that polygon but also try polygons represented by the color of nearby pixels). An alternate approach would simply be to draw the polygons without borders (being sure to draw any really thin strips last) and always look at the 9 closest pixels instead of just 1 pixel. - tye	[reply]


Problems? Is your data what you think it is?
	PerlMonks