http://qs321.pair.com?node_id=248282

blakem has asked for the wisdom of the Perl Monks concerning the following question:

I have a situation where I need to find the smallest N elements from a very large dataset, and I think a Heap sort work wonderfully. I tried Heap.pm but it is rather old, has mixed reviews, and seems to choke on the example code in the POD. Before I dig further into it I am seeking the wisdom of other monks...

Has anyone played with Heap? Any recommendations for other heap implementations that allow arbritray comparison functions?

-Blake

Replies are listed 'Best First'.
Re: Heap sorting in perl
by Ovid (Cardinal) on Apr 05, 2003 at 15:08 UTC

    Can you give some examples where it's having problems? Perhaps we can figure out what's going on. To be fair, I haven't used Heap.pm, but I did notice that it's tests are not terribly comprehensive, so I'm not surprised. I confess a desire to see what's not working and start writing tests for it :)

    If you happen to have a copy of "Mastering Algorithms with Perl", by O'Reilly, they have a heap implementation in there along with an excellent discussion. Perhaps rolling your own my help?

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

      Thanks for the book tip... After skimming through the "Mastering Algorithms" section on heaps and reading through the Heap.pm source, I was able to build a small script that seems to work:
      #!/usr/bin/perl -wT use strict; use Heap::Binary; use Heap::Elem::Num; my $heap = Heap::Binary->new; for (25,33,107,5,42,29) { my $elem = Heap::Elem::Num->new($_); $heap->add( $elem ); } my $min = $heap->extract_minimum(); print "Min: " . $min->val . "\n"; # prints 5
      Unfortunately this doesn't exactly match the usage explained in the POD. It could be just a documentation bug, but it worries me a bit.

      -Blake

      I just attempted to download and install Heap. Its example broke for me upon attempting call add. (Not autoloaded.) Perl 5.6.1 on Linux here.

      Is anyone maintaining it? Is it worthwhile to file a bug report?

        I got the same error with Perl 5.8 on Linux. After a quick search turned up mixed reviews for the module, I decided to post here before making my next move. At the very least, I'll email the author of Heap with a bug report.

        -Blake

Re: Heap sorting in perl
by blakem (Monsignor) on Apr 05, 2003 at 19:16 UTC
    The responses so far are very helpful, but I probably should have given a bit more context.....

    The resource I am optimizing for here is memory. I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.

    One approach to this, of course, is to sort the data in the database, but in this case I need the ability to define the sort function as arbritrary perl code.

    Heap sorting is attractive because selecting the smallest M elements from a dataset of size N (N being much larger than M) requires storing only M elements in memory at any one time. This would be a huge win for me.

    A heap is a datastructure that makes two operations very efficient: inserting a new element, and selecting (or removing) the largest element. The pseudo-code for my algorithm would look something like this:

    # select the smallest $M elements; while ($new_element = $dbh->fetchrow) { if ($heap->size < $M) { # build a heap of size M $heap->insert($new_element); } elsif ($element < $heap->largest) { # maintain a heap containing # the smallest M elements # seen so far $heap->remove_largest; $heap->insert($element); } }
    At the end of this loop we will have a $heap containing the M smallest elements, and never have to store more than M elements in memory at once.

    Therefore, I'm looking for a decent heap implementation and my first impression of Heap.pm was somewhat discouraging. I have several paths I can take from here, but before I go hacking up Heap.pm I was wondering if there might be a better place to start.

    -Blake

      I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.

      An obvious question, perhaps, but have you ruled out getting the database to do the work for you? That'd be a no-brainer if the column you're interested in is indexed. I'm guessing it isn't. Still, having the database do the sort might still be faster than scanning the entire table, pulling the data across a process boundary a row at a time, and then filtering it in the Apache process.

      Assuming you have a fairly large table, here's an ugly heuristic you might try: Probe the first 2*M rows of the table, sort, and find the M'th largest value. Then run a query that does a sort on the database side, subsetting the data before sorting by adding a WHERE clause to reject any row whose target column has a value greater than the M'th value you found the first time through. You can tune this by trying 3*M rows, 4*M rows, etc. The idea is to reduce the set that the database needs to sort.

        Letting the datbase do the sort is a good approach, but I was hoping for a more general, memory efficient solution that allows for arbitrary perl comparison routines.

        Its a common enough situation that I would like to create a subroutine that returns the rows I want and takes a database handle, a sorting coderef, M, and an optional offset. I'm envisioning something like:

        my $sortedrows = db_heap_sort($dbh, $coderef, $M, $offset);
        The question is, is there a perl heap implementation that would be suitable for implementing the above sub. Heap.pm seems like the obvious choice, but I have doubts about its current status.

        I have emailed the Heap.pm author, and will follow up with any new information I find out.

        -Blake

        If those are rows in a database, I agree with dws, just let database do it.

        If that particular column is not indexed, I suggest index it, as you need it.

        "order by" is usually not too expansive, especially when it only involves indexed range scan.

        Depends on your database, some database directly support syntax that allows to retrieve first N rows.
      Isn't there a solution posted already with working code which solves exactly this memory problem? It could readily be sped up substantially, and needs more testing though. But consider it proof of concept...
Re: Heap sorting in perl
by adrianh (Chancellor) on Apr 05, 2003 at 14:49 UTC

    No comments on Heap not having ever used it.

    However, unless you are dynamically adding and removing items to your dataset wouldn't something like:

    sub smallest_n (&\@$) { my ($cmp, $arrayref, $n) = @_; return unless $n && @$arrayref; $n = @$arrayref if @$arrayref < $n; my @results = sort $cmp @$arrayref[0..$n-1]; local ($a, $b); $a = pop @results; for (my $i = $n; $i < @$arrayref; $i++) { $b = $arrayref->[$i]; if ($cmp->() == 1) { @results = sort $cmp (@results, $b); $a = pop @results; }; }; return (@results, $a); }; use Test::More tests => 1; use List::Util qw(shuffle); my @a = shuffle (1..100000); my @b = smallest_n {$a <=> $b} @a, 5; is_deeply [@b], [1,2,3,4,5];

    be easier? The support for the heap datastructure will add a fair bit of overhead with your large dataset, so I wouldn't use one unless you need it.

    Update: totally misready what blakem was proposing. D'oh.

      The reason that someone with a large dataset might want a heap is that they don't want to have very much of the dataset in memory at once. With a Heap you can add an element at a time until you have your limit, and from there on you can add one and remove the biggest each time.

      When you are done you can then just extract off all of the elements in the heap, and you have the smallest N of them from largest to smallest. Without excessive memory usage.

        Erm. No :-)

        You have to add all the items from the dataset to the heap before you can remove the N lowest (or highest, depending on the direction your grow your tree).

        Yes, you could have the heap as an out-of-memory structure. However, if the dataset is on disk you can just read it in element by element and use the algorithm I proposed. It's still going to be less expensive in time and space than creating a heap.

        Unless you are goin to be adding and removing entries from the data set and need to keep it ordered a heap is overkill for the problem as stated.

      Your algorithm runs in Omega (N k log k), where N is the size of the set, and you are interested in the k smallest elements. Which is pretty lousy. With k for instance N / 100, your algorithm is worse than doing a bubble sort of the entire set, and getting the k smallest from the sorted set.

      Abigail

Re: Heap sorting in perl
by Anonymous Monk on Apr 05, 2003 at 15:29 UTC
    Rolling your own can be educational. I did it because I thought that it would be trivial to do. By the time I finished..well I was educated. ;-)

    Note that I used the quicksort rather than heapsort because I assumed that the issue is memory usage, and quicksort is easier to do.

Re: Heap sorting in perl
by pg (Canon) on Apr 05, 2003 at 15:16 UTC
    It seems not neccessary to have a separate sort step.

    You can do this:
    1. First take N elements from the begining of the unsorted array, and form a new array.
    2. Sort the new array, which would be much fast, assuming N is much smaller than the original array size.
    3. Go thru the rest array element in the original array, insert a element into the sorted array, if it is smaller than the largest element of the new array, and does not exist in the new array, and also get rid of the largest element in the new array.
    4. This algorithm goes thru the entire original array, but only for comparing, and most of the time, no extra effort (inserting) other than comparing is needed.
    (strike out per adrianh's comment)If you are willing to spend a little bit more effort, then I sugegst this improvement:

    In step 3, instead of going thru the entire original array, you can chop the original array into pieces, say each piece contains 10 * N element (tweak with this 10, it could be 5, could be 50...). Sort each piece (we are sorting some smaller arrays), then only take the first N elements of the sorted piece, and go thru them.
      A heap is a specialized datastructure that can be thought of as an "automatically sorted array" given a somewhat scaled down definition of "sorted". Sorted in this case means that the largest element is always easy to find and remove, and inserting new elements is also easy.

      Given that, our steps to find the smallest M elements would be:

      1. Build a heap from our first M items
      2. Compare next item to largest element in heap
      3. Replace largest with new if new < largest
      4. Repeat steps 2 and 3 for all remaining items

      As you can see, this is very similar to the strategy you proposed. In fact, you could view heaps as a datastructure designed specifically to implement this algorithm efficiently.

      -Blake

      In step 3, instead of going thru the entire original array, you can chop the original array into pieces, say each piece contains 10 * N element (tweak with this 10, it could be 5, could be 50...). Sort each piece (we are sorting some smaller arrays), then only take the first N elements of the soted piece, and go thru them.

      How is this an improvement (he asks curiously ;-) I don't see how creating a sorting several subsets of the data can be cheaper in time or space than a single pass through everything.

        adrianh is obviously right about this.

Re: Heap sorting in perl
by Abigail-II (Bishop) on Apr 06, 2003 at 21:51 UTC
    #!/usr/bin/perl use strict; use warnings; my @heap; sub heapify; sub heapify { my $idx = shift; my $min = $idx; for my $try (2 * $idx + 1, 2 * $idx + 2) { $min = $try if $try < @heap && $heap [$try] < $heap [$min] } return if $min == $idx; @heap [$idx, $min] = @heap [$min, $idx]; heapify $min; } sub extract_min { return unless @heap; my $min = $heap [0]; my $tmp = pop @heap; if (@heap) { $heap [0] = $tmp; heapify 0; } return $min; } # # First argument the number of smallest elements to be returned. # Then the list to find the elements in. # sub smallest { my $k = shift; @heap = @_; for (my $i = int (@heap / 2); $i --;) { heapify $i; } my @result; push @result => extract_min while @heap && $k --; @result; } __END__

    The running time of this is O (N + k log N), where N is the size of the set, and k the number of elements to be returned.

    Abigail

      Doesn't this require the whole list to be in memory? And doesn't that create a problem with the "from a very large dataset"?

        The original question didn't specify how the data would be available, so I'm just going to assume it fits in memory. Note that a heapsort only requires a constant amount of memory additional to the data to be sorted.

        If the data doesn't fit in memory, then that's a different problem. One can always tie the used array to whatever solution will keep the data to be worked on out of memory.

        Abigail

Re: Heap sorting in perl
by pg (Canon) on Apr 06, 2003 at 03:43 UTC
    Read that Heap.pm, and its default implementation Heap::Fibonacci. To be frank, don't have a good impression.

    The biggest problem I have with it is the storage it used. It used a tree based on hash, instead of a tree based on N-sequential array. A hash uses more memory than n-sequential array, and slower in indexing as "N-sequential" is not utilized.

    A N-sequential array is physically just a conventional array, but its element are specially positioned within the array, and can be easily viewed/used as a tree (in this case a heap).
Re: Heap sorting in perl
by aquarium (Curate) on Apr 07, 2003 at 10:24 UTC
    No magic from me..sorry :) Typical space vs speed problem. The faster (all in memory) based algorithms will eat up RAM; whilst slower (bubblesort, quicksort: even via temporary files if you need to) consume less memory but are slow. It's more of a question of striking the right balance of space use and speed attained. If we had any idea as to the sorting required, then perhaps someone could pull a rabbit out of a hat for this one--alas without knowing what SORT you need, no efficient sorting algorithm can be devised. Chris