Re: Speeding up point-in-polygon -- take two
by merlyn (Sage) on Aug 28, 2006 at 05:17 UTC
|
Use the right tool for the right job. Postgres contains a built-in function that computes whether a geometric object contains another geometric object. Put your data into a Postgres database, and ask the right questions, and you'll be doing it as fast as possible.
| [reply] |
|
>Postgres contains a built-in function that
>computes whether a geometric object contains
>another geometric object.
PostGres/PostGIS is can't be used because of project constraints (no complex db servers, etc.). Nevertheless, been there, done that. To really use the PostGIS functions (I am on that mailing list), I have to take account of other issues such as correct projections. Makes the task more complex than I need it to be. And, it is not an easily portable solution.
In any case, my testing shows that speed is comparable to what I have above.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
Re: Speeding up ... (O(N) determination of point in polygon regardless of complexity)
by BrowserUk (Patriarch) on Aug 28, 2006 at 20:05 UTC
|
In Speeding up point-in-polygon -- take two, punkish was looking for ways of speeding up point in polygon calculations. Various ordering and searching methods come to mind which improve the performance by preselecting the polygons against which each point is compared. Combined with performing the math in C, these techniques can reasonably be expected to improve the performance by up to an order of magnitude.
However, I learned long ago that the quickest way to do something is to avoid doing it at all. And in the case of computer math, if you can avoid doing any math at all, you'll definitely speed things up. A common technique for avoiding math, especially time expensive math is to trade memory for speed and use a lookup table. Eg. A lookup table of logarithms as those of use who used log tables at school will remember, is much faster than evaluating the polynomial expression required to calculate them.
For the point in polygon problem of determining which zip-code a particular point falls in, the simplest way of doing this is to construct a 2-dimensional lookup table of the complex polygons representing each zip-code area and store the zip-code (or a numeric value that represents the zip-code), at the crossover of each pair of coordinates. Determining the zip-code from a point then becomes a simple task of retrieving the value stored at the point index by the (X,Y) pair of the point in question.
My solution to this problem recognises that an image-file in which each of the polygons has been plotted, each using a unique colour, is simple a large scale map. It makes the lookup a simple process of loading the image and doing a getPixel( x,y ) of the relevant point. The color (index) returned is a direct representation of the zip-code at that point on the map.
The nice thing about this process is that it does not matter how complex your polygons are, the lookup is still O(n) in the number of points to be looked up. Obviously, for a map the size of the US, some scaling is required to achieve the required accuracy.
Note: Given accurate enough image plotting, this even caters for the case of a single street (red line in the purple area), having a different zip-code to the area surrounding it on all sides.
Accuracy & scale
Depending upon the required accuracy, a single image file sufficient to cover the entire USA, may be impractical to load into memory. In this case, it may be necessary to split the map into 4 or 16 ro 64 or some other convenient number of smaller images. You would then pre-order the points so as to minimise the need to switch images. Ie. You load the first image and only look up points that fall within it's boundaries before loading the second image.
My proof-of-concept code consisted of looking up 5.25e6 randomly generated points in a single 5000x5000 image containing 2.5e5 polygons. It did this in around 35 seconds on my machine which makes it 50,000 times faster than the OP's method. Even if 256 separate images have to be loaded, it will still be at least two order of magnitude faster.
For ease of generation, the 'polygons' I used were regularly spaced, equal-sized circles, but as noted earlier, once plotted, neither the number nor complexity of the polygons has any affect upon the performance of the method.
Proof-of-concept code
Image generator (polygon plotter)
I've incremented the color used for each polygon by 64 (2563 / 250,000 = 67.1) in an attempt to produce a visually pleasing map, but in reality the map does not need to be visually pleasing so a standard by-1 increment could be used which would allow a 24-bit image to handle up to 16 million polygons with no affect on performance.
#! perl -slw
use strict;
use GD;
my $map = GD::Image->new( 5000, 5000, 1 )
or die $!;
my $index = 0;
my $bg = $map->colorAllocate( 0, 0, 0 );
$map->filledRectangle( 0, 0, 4999, 4999, $bg );
for my $y ( map{ $_ *10 +5 } 0 .. 499 ) {
for my $x ( map{ $_ * 10 + 5 } 0 .. 499 ) {
my $color = $map->colorAllocate( unpack 'xCCC', pack 'N', $ind
+ex += 64 );
# print $color;
$map->filledEllipse( $x, $y, 10, 10, $color );
}
}
open IMG, '>:raw', '569929.png' or die $!;
print IMG $map->png;
close IMG;
Random point generator #! perl -slw
use strict;
open POINTS, '>', '569929.points' or die $!;
printf POINTS "%d:%d\n", int( rand 5000 ), int( rand 5000 ) for 1 .. 5
+.25e6;
close POINTS;
Lookup process benchmark
#! perl -slw
use strict;
use Benchmark::Timer;
use GD;
my $map = GD::Image->newFromPng( '569929.png', 1 );
open POINTS, '<', '569929.points' or die $!;
my $T = new Benchmark::Timer;
$T->start( '5.25e6 points in 2.5e5 polys' );
while( <POINTS> ) {
chomp;
my( $x, $y ) = split ':';
my $poly = $map->getPixel($x, $y );
# printf "Point ( %4.4d, %4.4d ) found in poly: %d\n", $x, $y, $pol
+y / 64;
}
$T->stop( '5.25e6 points in 2.5e5 polys' );
$T->report;
close POINTS;
Timings of several runs c:\test>569929-b
1 trial of 5,25e6 points in 2.5e5 polys (33.094s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (36.656s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (33.250s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (35.094s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (33.734s total)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
I have been thinking about BrowserUk's proposed solution, and have concluded this to be one of the most innovative strategies I have seen in a long time. Using a png essentially as a lookup table... even though I haven't tried the solution yet, I have to say "wow! Bravo!"
Well, I have been thinking about it, and there are a few quibbles, and possible mods --
As noted in the solution, a single image to cover the entire country would be simply too big, and would have to be chopped up. That creates problem of juggling images in memory, as there is really no easy way to sort the points and process them in image-based batches.
The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up.
If I could figure out how to generate all the pixel coords for a polygon, I wouldn't even need GD. I could simply store each pixel (something akin to an INTEGER UNIQUE NOT NULL would work just fine) and its associated attribute in a Berkeley Db hash. The advantage of Bdb would be that I wouldn't have to worry about image file sizes and chunking up images as in the GD approach.
So, does anyone know of a way to generate all the coord pairs for a given polygon and a resolution?
By the way, I tested my approach posted in the OP that started this thread. Not including the data prep, it took me about 3.25 hours to update 5.25 million points against 210k polys. Nevertheless, compliments to BrowserUk for a truly innovative solution.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
as there is really no easy way to sort the points and process them in image-based batches.
Assuming your zipcode polygons are obtained from somewhere like here, the each of the .e00 files contains the polygons for the zipcodes in each state. By building one image per state you will have 48/50 images and 48/50 bounding boxes. Pre-selecting or ordering the points by those bounding boxes (much as you currently do using the bounding box of the individual polys0, is simple, and as there are many fewer states than zips, much quicker.
If the points originate in a DB, then something like
select x,y from points
where x >= state.minx and x <= state.maxx
and y >= state.miny and y <= state.maxy;
would allow you to load the state images one at a time and select only the points (roughly) within state for lookup.
The pixels in the image can be mapped to the smallest unit of the coordinate, but fractional units for a point like 1.24435, 455.22452 get all fudged up.
Taking Texas as an example (guessing that it is the largest state from looking at a map), width covers 13 degrees of Longitude and 773 miles. That means that each arc minute represents ~1 mile and each arc second ~88 feet. In decimal degrees, each 0.0001 of a degree represents ~32 feet. (Check my math, it's getting late!)
So, if you represent the biggest state by an image 13000x11000 pixels, and multiply all the coordinates in your polys by 10,000 and truncate, each pixels will represent ~32ft X 32 feet. The image takes under 600MB when loaded in memory. If you have fully populated hardware, you could perhaps increase your resolution by a factor of 10 if need be. Remember that the majority of states are considerably smaller.
There will obviously be some overlap between bounding boxes, which means some points will be lookup in 2 or 3 maps, but it is convenient that in large part, most state boundaries follow lines of longitude and latitude.
There is even a module Geo::E00 for processing these files :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Converting a polygon to pixels is something that graphics libraries are highly optimized at. So I'd go with the graphics library approach.
It may be as simple as drawing a border around each polygon in white. If your lookup finds a white pixel, then look at nearby pixels to find the colors of polygons to test the 'long' way. Then you don't have to have a really high-resolution image.
If you've got very thin strips, then that may require special handling. Assigning one more color to each polygon that has such should cover it, though (that color means try that polygon but also try polygons represented by the color of nearby pixels).
An alternate approach would simply be to draw the polygons without borders (being sure to draw any really thin strips last) and always look at the 9 closest pixels instead of just 1 pixel.
| [reply] |
Re: Speeding up point-in-polygon -- take two
by roboticus (Chancellor) on Aug 28, 2006 at 14:20 UTC
|
If this is a task you're going to do repeatedly, and your polygon list is relatively static, then you should consider redefining the problem. I suspect that the loop and indexing operations my be consuming a considerable amount of your time. So maybe you can change the problem to unroll the loops.
For example, instead of storing the polygons as they're defined, break the polygons into horizontal strips with
with P[0]=lower-left, P[1]=upper-left, P[2]=upper-right, and P[3]=lower-right. In the case that the top or bottom edge is a point, you'll have the same thing, where P[1]==P[2] or P[3]==P[0], respectively. Then you'll insert these simpler polygons (quadrilaterals, nominally) into the database, so you can use a simpler test. Thus, we break this polygon:
*--------------------*
\ A /
* *-----* /
/ \ \ /
/ \ \ /
*--------* \ /
*
into these three:
*--------------------*
\ A /
*--*-----*-------*
/ \ \ A(3)/
/ A(2) \ \ /
*--------* \ /
*
Since we have quadrilaterals (and triangles, sort-of), we can simplify your subroutine into something resembling:
sub _pointIsInQuadrilateral {
my ($a_point, $a_x, $a_y) = @_;
# point coords
my ($x, $y) = ($a_point->[0], $a_point->[1]);
# poly coords
# $n is the number of points in polygon.
my @x = @$a_x; # Even indices: x-coordinates.
my @y = @$a_y; # Odd indices: y-coordinates.
if ($x < ($x[1]-$x[0]) * ($y-$y[0]) / ($y[1]-$y[0] + $x[0]) ) {
# $x is left of the leftmost edge, so it's not in polygon
return 0;
}
if ($x < ($x[3]-$x[2]) * ($y-$y[2]) / ($y[3]-$y[2] + $x[2]) ) {
# $x is right of the rightmost edge, so it's not in polygon
return 0;
}
# Point is inside quadrilateral!
return 1;
}
As you can see, by redefining the problem and using quadrilaterals with parallel top and bottom edges, we get the following benefits:
We omit one of the parameters ($n)
We eliminate the Y test altogether
We also unroll the X coordinate tests to a simple pair of left-right tests
Your polygon structure in the database can be simpler (see also note below).
So we're simplifying the code by increasing our polygon count. Even with the increased number of shapes in your table, I'm guessing that select statement will probably be about the same speed. Yes, there'll be more polygons, but they'll have smaller bounding boxes, so there'll be fewer false hits. I also think (without testing!) that _pointIsInQuadrilateral should be considerably faster.
This is only one way you could redefine the problem. After studying, you might find a different way to redefine the problem to speed things up. If you try this approach, please let me know how it turns out!
Note: One other advantage is that with a fixed number of points per polygon, you can store the point coordinates in table directly as numbers, rather than having a list of coordinates in text. Omitting the parsing of text to numbers may provide another speedup....
--roboticus | [reply] [d/l] [select] |
|
Another thought--
If the aforementioned suggestion helps, then it may be possible to improve performance a bit more by adding even more polygons: You might break your polygons into rectangles, left triangles and right triangles, where a rectangle is 'definitely inside' a polygon, and the triangles are pasted to the left and right side of your set of rectangles to build the polygons you desire.
With the rectangles, you can avoid the subroutine call altogether. If your select statement returns one, you have the answer "Yep, it's in polygon A". If it's a triangle, then you only have one edge to check. (For a left triangle, your point is inside the polygon of $x is to the right of the edge; and for a right triangle, your point is inside if $x is to the left of the edge.)
By chewing on the problem, I'm certain you'll come up with a few other things to try...
--roboticus
| [reply] |
|
Okay--Yet another thought (and then I promise I'll shut up for a while!).
How important are exact results? If somewhat "fuzzy" edges are acceptable, you could use a quadtree, and just insert enough nodes to get the precision you want. Then you'd only need a select statement to determine whether your point is internal to the polygon.
--roboticus
Who intends to stop thinking about this question for at least an hour!
| [reply] |
|
Total Elapsed Time = 251.1784 Seconds
User+System Time = 189.3034 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
72.3 136.9 136.91 220726 0.0006 0.0006 main::_pointIsInPolygon
9.09 17.21 17.216 5001 0.0034 0.0034 DBI::st::fetchall_arrayref
6.25 11.82 11.825 13887 0.0009 0.0009 DBI::st::execute
4.84 9.162 178.99 1 9.1622 178.99 main::updatePoints
1.82 3.442 3.442 3885 0.0009 0.0009 DBI::db::commit
0.16 0.298 0.298 5001 0.0001 0.0001 DBI::st::fetchrow_array
0.07 0.141 0.141 3887 0 0 DBI::st::finish
0.01 0.01 0.01 3 0.0033 0.0033 IO::Handle::new
0 0 0.01 5 0 0.002 Geo::ShapeFile::get_bytes
0 0 0 6 0 0 DBI::_new_handle
0 0 0.01 5 0 0.002 Geo::ShapeFile::get_handle
0 0 0 1 0 0 DBI::connect
0 0 0 4 0 0 DBI::db::prepare
0 0 0 4 0 0 DBI::_new_sth
0 0 0 3 0 0 Geo::ShapeFile::dbf_handle
What I really have to figure out is to reduce those tests. I have an idea which I have to now figure out how to implement. Basically, it is like so --
Once I get a bunch of potential points within the rect of a poly, I should find the 4 outermost points in each dimension. Once I determine those four definitively, all points within those 4 would be also within that poly, and I wouldn't have to do the checks for them.
Now the hard part -- realize the above hypothesis.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
_________
| . ./
| /
| /x
| \
| \
| . .\
---------
If all of your polygons were guananteed to have internal angles > 90 degrees that might work, but you only said that they were "irregular". I had a similar idea yesterday (except mine was finding the maximum internal bounding rectangle for the polygon rather than using the query points themselves), and discarded it for this reason. :-)
| [reply] [d/l] |
|
|
punkish:
What I really have to figure out is to reduce those tests.
That's why I made my original suggestion. There are two ways that breaking the polygons into simple shapes will help you:
- With known shapes, your _pointIsInPolygon function is simpler, so you can use quick exits and/or the compiler can optimize it, and
- you can increase your density of polygon vs. bounding boxes.
In your original _pointIsInPolygon, you use variable indices into your vertex arrays, as well as adding a loop with it's inherent termination condition. By breaking your polygons into triangles or quadrilaterals, you can use fixed indices and omit the loop. That may significantly increase the subroutines speed. Also, your algorithm requires that you examine every edge in the polygon. By enforcing an order to the edges, you may be able to return early from the routine by taking advantage of that knowledge.
Regarding the 'density of polygon vs. bounding boxes': A rectangle aligned with your X and Y axes has a 100% chance of containing a point returned by your SELECT statement. An arbitrary triangle has a 50% chance of containing a point in the bounding rectangle. So breaking your polygons into triangles and (aligned) rectangles will have a worst case coverage of 50%. And a higher coverage density directly correlates to a lower false-positive rate. (I.e., with a higher coverage, you'll need fewer tests because your SELECT will hit fewer bounding boxes.) With arbitrary polygons, you can have nearly *any* hit rate:
*--------------------* *--------------------* *--------------------*
| | \ / |*------------------*|
| | \ / || ||
| ** \ / || ||
*-------------------* *-------------* ** **
~99% ~90 ~32%
While I still think my original suggestion would be interesting, you state:
Yes, the polys are relatively static, although the points are not so. Nevertheless, I do indeed unroll both polys and points, and move them into the SQLite db. That way I avoid going back to the files repeatedly.
By this, I'm assuming you mean that the points are in the same rough relative position in the polygons, meaning that (1) the points are connected in the same order, and (2) the edges will never cross. For example, the shape on the left could morph into the shape in the center, but not the one on the right:
a----g g g
/ | / \ / \
/ | / \ / \
b | == / \ != / \
\ e---f / \ / \
\ \ a-b f a c e
c---d / / \/| /
c---d / /\| /
\ / / b /
e d-----f
In that case, breaking the polygons into X-aligned trapezoids (my original suggestion) may be a bit too restrictive. Perhaps using arbitrary quadrilaterals (or triangles) would give you (a) enough coverage density (i.e.just breaking the shape into quadrilaterals instead of trapezoids may simplify things enough. So you'd increase your coverage, and by breaking the polygons into triangles, you could still simplify your _pointIsInPolygon subroutine to eliminate the loops and variable indices.
--roboticus | [reply] [d/l] [select] |
Re: Speeding up point-in-polygon -- take two (Under 40 seconds!)
by BrowserUk (Patriarch) on Aug 28, 2006 at 13:22 UTC
|
The following shows the output from several runs of a benchmark looking up 5.25 million randomly generated points in a coordinate space of 5000x5000 that contains 1/4 million polygons:
c:\test>569929-b
1 trial of 5,25e6 points in 2.5e5 polys (33.094s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (36.656s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (33.250s total)
c:\test>569929-b
1 trial of 5.25e6 points in 2.5e5 polys (35.094s total)
The program runs in under 40 seconds and requires ~100MB. No database.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
> The following shows the output from several runs
> of a benchmark looking up 5.25 million randomly
> generated points in a coordinate space of 5000x5000
> that contains 1/4 million polygons:
Ya but, are your polys regular? If so, that changes the whole game. It is because my polys are irregular that I have to do a definitive test of every point against all the vertices of every poly.
Please post your code as it may really give me some great ideas.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
Ya but, are your polys regular?
Using my method, this doesn't affect the performance.
What does affect it is the size of the coordinate space and the minimum granularity of points.
Update: I guess someone doesn't believe me ;)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Speeding up point-in-polygon -- take two
by hardburn (Abbot) on Aug 28, 2006 at 15:04 UTC
|
Using a bounding box is a good start, but a bounding circle should be even better. A circle only needs to test one dimention:
if( $circle->distance_from_center( $x, $y ) <= $circle->radius ) {
# Point ($x, $y) is in circle
}
I've also found that dropping some of the calculations into C can speed it up quite a bit without being terribly complex to interface back to Perl.
"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.
| [reply] [d/l] |
Re: Speeding up point-in-polygon -- take two
by BrowserUk (Patriarch) on Aug 28, 2006 at 11:49 UTC
|
Could you post an example of your polygons; and state the bounds for the overall coordinate space please.
I believe that there is a way to do this 1 or even 2 magnitudes faster.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
> Could you post an example of your polygons; and
> state the bounds for the overall coordinate space
> please.
Think of zipcodes for the entire US... in lambert conformal projection, if that helps (although that doesn't matter, because both points and polys are in the same proj).
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
I guess that I could go off, find a map of the US and work out the maximum coordinate space. I could then try and find a source of zipcode polygons somewhere on the web and so determine the precision with which the are marked.
But presumably you already know what you are dealing with?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Speeding up point-in-polygon -- take two
by explorer (Chaplain) on Aug 28, 2006 at 16:35 UTC
|
if ( $y[i] <= $y ) {
if ( $y < $y[$j] ) {
# process ...
}
}
elsif ( $y >= $y[$j] ) {
# the same process ..
}
Or a simple calculation:
if ( ( $y[$i] - $y ) * ( $y - $y[$j] ) > 0 ) {
# process
Remember: Memoize is your friend. | [reply] [d/l] [select] |
Re: Speeding up point-in-polygon -- take two
by eric256 (Parson) on Aug 30, 2006 at 16:45 UTC
|
3d renderers do this all the time and normaly pretty darn fast. Just polygons + a raycast to determine which polygon you hit. Why not break down the polygon into triagles and then sort those with a b-tree? Isn't that still how 3d games handle the process? Alternatly just use a 3d engine and load all the polygons in and raytrace ;) but that seems silly.
| [reply] |
|
Years later...
I don't think that the ray projection will work without a method to see if the ray passes between two points that may make up the vertices of the polygon.
-
lartnec
| [reply] |