Perl Monk, Perl Meditation PerlMonks

Sorting hashes...

 on May 10, 2007 at 18:50 UTC Need Help??

fiddler42 has asked for the wisdom of the Perl Monks concerning the following question:

Kind, Gentle Folk,

I have a need to sort a bunch of geometric boxes in two ways, and I'm not sure how to do it in perl efficiently. Here are the starting conditions of this problem:-

1. I have a text string (the name of a box)

2. I have the lower left coordinates of a box

3. I have the upper right coordinates of a box

4. I have the area of a box

I have several million overlapping boxes within the area of one large, master box. Here is what I need to do:-

1. Sort all of the areas, from largest to smallest

2. Once the sort completes, I need the N largest boxes that do not overlap at all.

I am not looking for a detailed solution, just a rough idea on the best approach. Can someone help fill in these blanks?

```%NetAreaHash = ();
\$LL = join (" ",\$MinX,\$MinY); #\$MinX and \$MinY are decimals
\$UR = join (" ",\$MaxX,\$MaxY); #\$MaxX and \$MaxY are decimals
\$Area = abs(\$MaxX - \$MinX) * abs(\$MaxY - \$MinY);
\$NetAreaHash{\$Area} = join (" ",\$NetName,\$LL,\$UR,\$Area); #\$NetName is
+a string

#how do I numerically sort \$NetAreaHash by \$Area?

#how do I take the sorted \$NetAreaHash and identify the N largest area
+s that do not overlap?  N will usually be 2, 4 or 8.

Again, I'm just looking for some guidance here. The numerical sort of the hash probably won't be that tough, but identifying the N largest, non-overlapping boxes has me stumped at the moment!

Thanks,

Fiddler42

Replies are listed 'Best First'.
Re: Sorting hashes...
by xdg (Monsignor) on May 10, 2007 at 19:22 UTC

A couple thoughts:

• You might have rectangles with the same area, so your hash should key on the coordinates, not the area. E.g. \$NetAreaHash{\$MinX,\$MinY,\$MaxX,\$MaxY} = \$Area (the commas in the hash key combine to make a unique key).

• For checking overlap, if any vertex of rectangle A is within rectangle B then they overlap. If no vertex of A is within B, B might still be entirely contained within A, so check one vertex of B to see if it's within A.

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

For checking overlap, if any vertex of rectangle A is within rectangle B then they overlap. If no vertex of A is within B, B might still be entirely contained within A, so check one vertex of B to see if it's within A.

Actually, two rectangles could intersect without any vertex of either box lying inside the other. Picture two rectangles, perpendicular to each other, arranged as a cross.

Re: Sorting hashes...
by lin0 (Curate) on May 10, 2007 at 21:03 UTC

Hi fiddler42,

One alternative would be using The Perl Data Language (PDL). You could have a look at the online book and the installation instructions to have an idea of how long it would take for you to start using PDL. If you decided to go with PDL, you could use the following approach:

1. Create three piddles (piddles are the name for the data structures that hold matrices in PDL). One to hold the string Box Name (this one has to be of Char piddle type so you need to add use PDL::Char). The second one to hold the MinX, MinY, MaxX, MaxY values. The last one to hold the Areas
2. Sort the piddle of Areas using qsorti to have access to the indices of elements in ascending order. In this way, you could use the indices to sort all the piddles.
3. To determine if two boxes overlap, you could create two piddles of the size of the larger box (using zeroes, for example) and then put special values (you could use "ones" or any other value you wish) in the elements that are between the respective [MinX, MaxX] and [MinY, MaxY]. For this you would need to use which and index. Finally, you find the intersection between the two piddles using intersect. If the intersection is not empty then the two piddles overlap.

By the way, I recommend you to have a look at RFC: Getting Started with PDL (the Perl Data Language) to see how those functions work.

Cheers,

lin0
Re: Sorting hashes...
by FunkyMonk (Chancellor) on May 10, 2007 at 20:13 UTC
If you're using \$Area as the key to your hash

my @sorted_areas = sort { \$a <=> \$b } keys %NetAreaHash

will sort them in ascending order (swap \$a and \$b if you prefer descending).

If you're going to follow xdg's advice and have \$Area as the value of the hash, use

my @sorted_areas = sort { \$NetAreaHash{\$a} <=> \$NetAreaHash{\$b} } keys %NetAreaHash

Again, swap \$a and \$b for descending.

As to your main problem, I have discovered a truly remarkable solution which this textbox is too small to contain (tm) :-)

Re: Sorting hashes...
by salva (Canon) on May 11, 2007 at 09:07 UTC
A hash is not the right structure for this problem. You should use an array:
```my @box;
for (...) {
my (\$NetName, \$MinX, \$MinY, \$MaxX, \$MaxY) = ...;
push @box, [\$NetName,
\$MinX, \$MinY, \$MaxX, \$MaxY,
(\$MaxX - \$MinX) * (\$MaxY - \$MinY)];
}
and you can sort it as:
```@box = sort { \$a->[5] <=> \$b->[5] } @box;
or if that sort results too slow:
```use Sort::Key qw(nkeysort_inplace);

nkeysort_inplace { \$_->[5] } @box;
Then, you have to look for the N largest areas. It can be done recursively: get the biggest rectangle and look for the N-1 largest, non-overlapping areas.

BTW, "the N largest boxes" is not very precise. Which is larger, a set of rectangles with areas (9, 1) or another with (7, 6)?

I suppose you want to maximize the total area:

```
# untested!

my (\$area, @best) = n_largest_boxes(\@box, \$n);

sub overlap {
my (\$box1, \$box2) = @_;
return !( \$box1->[3] < \$box2->[1] or
\$box1->[1] > \$box2->[3] or
\$box1->[4] < \$box2->[2] or
\$box1->[2] > \$box2->[4] )
}

sub n_largest_boxes {
my (\$boxes, \$n, \$start, @acu) = @_;
my \$max = 0; # area for the biggest set of boxes found
my @max;     # rectangles on the biggest set
for (\$i = \$start || 0;
\$i + \$n <= @\$boxes and \$boxes->[\$i][5] * \$n > \$max;
\$i++) {
my \$box = \$boxes->[\$i];
next if grep overlap(\$_, \$box), @acu;
if (\$n > 1) {
my \$max1, @max1 = n_largest_boxes(\$boxes, \$i+1, \$n-1,
@acu, \$box);
if (\$max1 and \$max1 + \$box->[5] > \$max) {
\$max = \$max1 + \$box->[5];
@max = (\$box, @max1);
}
}
else {
return (\$box->[5], \$box);
}
}
return (\$max, @max);
}
Re: Sorting hashes...
by halley (Prior) on May 10, 2007 at 20:16 UTC
identifying the N largest, non-overlapping boxes

This sounds like a "clustering" problem to me... a google search on clustering techniques might do you some good.

--
[ e d @ h a l l e y . c c ]

Re: Sorting hashes...
by wjw (Priest) on May 11, 2007 at 04:57 UTC
You might try a google search on "adjacency" as well. This sounds a lot like what I have run into in the IC packaging and printed circuit board industry, so those might be filters on your search as well.

Seems to me I did something like this a long time ago when I was munging test point data. I seem to recall pre-filtering to reduce the work load by finding the center of my features (boxes) and identifying those that could not possibly overlap first. It has been a long time... sorry I did not hang on to that old code. Best of luck with your project.

• ...the majority is always wrong, and always the last to know about it...
• The Spice must flow...
• ..by my will, and by will alone.. I set my mind in motion
postgresql geometric datatypes (was Re: Sorting hashes...)
by doom (Deacon) on May 11, 2007 at 07:52 UTC
This sounds like a job for the postgresql database to me. It has a number of geometric data features (that I'm always looking for an excuse to play with some time).

There's a "box" datatype, defined by the two corners, e.g. box '((9,10), (11,13))', and there's an "area" function, e.g. area( box '((9,10), (11,13))' ), and there's also an "&&" operator you can use to see if two boxes intersect (which could be used with the logical NOT, I would presume, to find non-intersecting boxes).

Off the top of my head I don't see an easy way to get what you want all in one query, but I would think you could write a query to give you the next smaller box that doesn't intersect the already selected boxes... then you'd just need to repeat that process N times.

Re: Sorting hashes...
by dk (Chaplain) on May 11, 2007 at 13:24 UTC
Sorting problem aside, which is more-less straighforward, selecting the largest non-overlapping boxes is actually more interesting task. A couple of thoughts:

The "largest" actually means that you have to select 1st largest box, and then try to fit another moving down the sorted list, and so on, until you have N selected boxes. If a box fits, it is added to the selection. The problem arrears when you cannot fit the required N boxes, so you have to backtrack, remove the last box from the set of selected boxes and replace it with the next, lesser one, and continue. The whole backtracking process can easily take forever, so some kind of smart caching along the way is a must, but I cannot think of any.

Hope this helps

Re: Sorting hashes...
by mattr (Curate) on May 14, 2007 at 07:24 UTC
Math::Geometry::Planar has
```\$polygon->isinside(\$point);

Returns true if point is inside the polygon/contour (a point is inside
+ a contour if it is inside the outer polygon and not inside a hole).

\$polygon->area;

Returns the signed area of the polygon/contour (positive if the points
+ are in counter clockwise order). The area of a contour is the area o
+f the outer shape minus the sum of the area of the holes.

But collision detection is an interesting area of research and there are a lot of algorithms out there. Much depends on your application in terms of how sparse your model is, whether these boxes represent objects moving in time, are they close to uniform size, etc. This is a basic area of computer graphics programming and squares or oblong boxes are often used as bounding boxes to test whether objects bump each other or not.

Also, millions of boxes creates memory and time constraints that may make many algorithms difficult to use. I recommend reading this paper on collision detection for large scale environments which describes methods used to overcome the exponential time requirements for pairwise comparisons.

One good technique seems to be to project the objects onto each axis so you get an interval on X and an interval on Y, and sorting separately these lists (projection is trivial if your rectangles are never rotated at an angle). Two rectangles can only intersect if their X and Y intervals intersect (though the intersection might be a point or line, not a rectangle, so you have to define whether a point touching is an intersection in your system).

It is also possible to create an "interval tree" (a range tree) that can be queried to find intersections. By sweeping a vertical line across the X axis you can look for intersections and prune. Perhaps creating such a tree is a bit overkill for you.

I haven't read enough of the paper to tell if the interval tree is in fact the same thing as a a hierarchy of bounding boxes to reduce the number of elements to sort. Also, if you have mostly similarly sized objects then the technique of dividing the space into equal volumes and checking for collisions within them may also be useful. Since you have a million objects, efficient collision detection and sorting are very important.

Also regarding sorting, you may be interested to know that there are more advanced ways to sort than just the standard sort function. Advanced sorting techniques like GRT and the Orcish maneuver, which cache keys to reduce the number of sorts needed, are discussed by Uri Guttman and Larry Rosler and available in Guttman's Sort::Maker module (along with our merlyn's Schwartzian Transform! Also from Perl 5.8.8 there is a sort pragma mentioned here that might be tweakable to improve performance even of the plain sort function. I'd like to hear how it turns out.

At least in the above type algorithms, collision detection is based actually on sorting. However it seems to me that the actual order of implementation and which method you choose depends on your data. Certainly if you have all positive integer coordinates you can tell the GRT so and it will be more efficient (it only checks the most significant bytes first). If you know the lower left coordinate is guaranteed to actually be the lower left one and not backwards, then you can eliminate huge swaths of the space by only sorting the x coordinate of that point. If you did indeed have a list sorted by area you could arbitrarily discard the lower 90% of your rectangles.. but sorting a million of anything will probably take too long, so it is best if you can make your first sweep through the data do something easy like partitioning it, or using some trick like pack() to discard for example anything with less digits, etc. For a real implementation I'd have to know more about just what you are trying to do and see some of the data.

Re: Sorting hashes...
by petdance (Parson) on May 11, 2007 at 18:29 UTC
Please note that you cannot "sort a hash", because a hash has no order. You can sort the keys of a hash like any other list, however.

xoxo,
Andy

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://614734]
Approved by xdg
Front-paged by liverpole
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2022-05-28 16:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Do you prefer to work remotely?

Results (99 votes). Check out past polls.

Notices?