A heap of medians: efficiency vs. speed

A brief description of heaps

A heap can be thought of as a binary tree, but it's stored as an array (so it's very compact). The defining property of the heap is that at any subtree, the largest item in the tree is at the root, which makes a heap partially sorted. A full description of a heap is beyond the scope of this meditation, but for my purposes, the important properties of a heap are as follows.

A heap that's been broken by adding or removing one element from the "top" can be made back into a heap in O(log n) time.
An unordered list can be made into a heap in O(n log n) time by walking from end to end doing the above O(log n) operation—so this is a relatively cheap O(n log n).
Removing the top (largest) element over and over—with a O(log n) operation each time—produces sorted elements.

Heaps are interesting for other reasons, and I'd encourage any monk to look into them further. They're especially good for a priority queue in which you're always interested in the biggest thing.

(Note that "heap" can also refer to an area of memory used for dynamic memory allocation.)

Half sorting

Here's my relatively efficient way to find the median of a set. (I say "my" way because I thought of it independently, though certainly not first.)

Put the set into a heap.
Pull elements off the top until half the set is sorted.
The median is on top.

Compare this to a more mundane approach to medians: sort the entire set and then pull out the relevant value. By sorting only half the list, it should be possible to attain the median more efficiently than a full O(n log n) sort.

Testing the supposed efficiency gain

In the code below, I create a class for tied scalars which act like real scalars except that every FETCH() increments a counter. I feed lists of these scalars into median finding functions based on both sort and Heap::Simple and compare the number of accesses required by each algorithm. It's assumed that two accesses is one comparison performed by the algorithm. This way we can see which method does more comparisons.

use strict;
use warnings;

package Tie::Scalar::Count;

use vars qw( $Count );

$Count = 0;

sub TIESCALAR { bless \my $x, shift }
sub STORE { ${ $_[0] } = $_[1] }

sub FETCH {
    $Count++;
    return ${ $_[0] };
}

package main;

use Test::More 'no_plan';
use List::Util 'shuffle';
use Heap::Simple;

sub scalars2counters {
    foreach my $m ( @_ ) {
        my $x = $m;
        tie $m, 'Tie::Scalar::Count';
        $m = $x;
    }
}

sub median_sort {
    my @s = sort { $a <=> $b } @_;
    my $i = scalar @s / 2;
    return   ( $i == int $i )
           ? ( $s[ $i-1 ] + $s[ $i ] ) / 2
           : $s[ int $i ];
}

sub median_heap {
    my $heap;
    my $tied = !!(tied $_[0]);
    if ( $tied ) {
        $heap = Heap::Simple->new( elements => ['Method' => 'FETCH'] )
+;
        $heap->insert( map { tied $_ } @_ );
    }
    else {
        $heap = Heap::Simple->new();
        $heap->insert( @_ );
    }

    my $i = scalar @_ / 2;
    for ( 0 .. int $i - 2 ) {
        $heap->extract_top;
    }
    if ( $i == int $i ) {
        my $v1 = $heap->extract_top;  # $i - 1
        my $v2 = $heap->extract_top;  # $i
        if ( $tied ) {
            $v1 = $v1->FETCH();
            $v2 = $v2->FETCH();
        }
        return ( $v1 + $v2 ) / 2;
    }
    else {
        $heap->extract_top;  # $i - 1
        return $tied ? $heap->extract_top->FETCH() : $heap->extract_to
+p;
    }
}

# Confirm that Tie::Scalar::Count works as advertised
{
    tie my $foo, 'Tie::Scalar::Count';
    is( $Tie::Scalar::Count::Count, 0, 'count is zero initially' );
    $foo = 10;
    is( $Tie::Scalar::Count::Count, 0, 'count is zero after STORE' );
    is( $foo, 10, 'FETCH works' );
    is( $Tie::Scalar::Count::Count, 1, 'count is 1 after FETCH' );
}

# Confirm that median_sort and median_heap get the right medians
{
    my @odd = ( 1, 2, 3 );
    my @even = ( 2, 4 );

    is( median_sort( @odd ), 2, 'odd median_sort is correct' );
    is( median_sort( @even ), 3, 'even median_sort is correct' );

  TODO: {
        local $TODO = 'heap method not ready for a short list';
        is( median_heap( @odd ), 2, 'odd median_heap is correct' );
    }
    is( median_heap( @even ), 3, 'even median_heap is correct' );

    my @bonus1 = shuffle 0 .. 100;
    my @bonus2 = shuffle 1 .. 100;
    is( median_heap( @bonus1 ), median_sort( @bonus1 ),
        'larger odd list is correct' );
    is( median_heap( @bonus2 ), median_sort( @bonus2 ),
        'larger even list is correct' );
}

# Check that scalars2counters works
{
    my @scalars = 0 .. 10;
    scalars2counters( @scalars );

    my $tied = grep { tied $_ } @scalars;
    is( $tied, scalar @scalars, 'all elements tied' );
}

# Confirm that counting works when using median_sort
{
    $Tie::Scalar::Count::Count = 0;

    my @m = shuffle 0 .. 10;
    scalars2counters( @m );
    
    is( median_sort( @m ), 5, 'median_sort is correct' );
    ok( $Tie::Scalar::Count::Count > scalar @m,
        "count ($Tie::Scalar::Count::Count) is > @{[ scalar @m ]}" );
}

# Confirm that counting works when using median_heap
{
    $Tie::Scalar::Count::Count = 0;

    my @m = shuffle 0 .. 10;
    scalars2counters( @m );

    is( median_heap( @m ), 5, 'median_heap is correct' );
    ok( $Tie::Scalar::Count::Count > scalar @m,
        "count ($Tie::Scalar::Count::Count) is > @{[ scalar @m ]}" );
}

{
    $Tie::Scalar::Count::Count = 0;

    my @m = shuffle 0 .. 100;
    scalars2counters( @m );

    my $heap_median = median_heap( @m );
    my $heap_compares = $Tie::Scalar::Count::Count;

    $Tie::Scalar::Count::Count = 0;

    my $sort_median = median_sort( @m );
    my $sort_compares = $Tie::Scalar::Count::Count;
    is( $heap_median, $sort_median, 'Heap answer same as sort' );
    ok( $heap_compares < $sort_compares,
        "fewer heap compares ($heap_compares < $sort_compares)" );
}
[download]

Efficiency is not speed

The heap method wins. In my tests, it typically does the job with about 65% of the comparisons required by a full sort. This seems like a big win, but there's a catch.

Using Benchmark on these functions, the sort-based median comes out about three times faster for a list with millions of elements. The efficiency advantage of the heap can't beat the optimization advantage of sort. In my testing, I ran out of memory before I found a list long enough for the heap's efficiency to win out.

Anecdote

Years ago, I converted some sort-based median-finding code to use a heap instead. That was in C++, so the different implementations were on more of an equal footing. In arrays of two million floats, I got a performance gain, but it was small enough that it could have been a measurement error (I was using a stop watch).

Implementation matters

I tried this first with Heap::Binary instead of Heap::Simple, and it was actually less efficient than a simple sort. I never looked into why, but clearly not all heaps are created equal.

The heap method could probably be made faster if it were tailor made to the task at hand. Using the CPAN module, I was actually removing elements on the way to the median, but it's not necessary to do this. A custom implementation could walk down the array without having to shorten it. The custom implementation also wouldn't have the baggage of supporting the tied scalars I was using for testing. Finally, and perhaps most importantly, building a heap from unordered data can actually be done faster than O(n log n) using Tarjan's algorithm (I didn't know this before I started looking around Wikipedia for this writeup).

Last thoughts (for lack of a conclusion)

I wrote this up mostly as (what I hope is) an interesting way to look at a common problem. Because sort is so much faster, this won't be of much use to Perl programmers.

I also find this an interesting method to look at the efficiency of a piece of code over its speed. The code in question can stay a black box, and we can still get some idea of how much work it does on the data involved. Most of the time we'll be more interested in speed than efficiency, but the latter may be worthwhile for estimating how an algorithm will scale.

Comment on A heap of medians: efficiency vs. speed Select or Download Code

Replies are listed 'Best First'.
Re: A heap of medians: efficiency vs. speed by salva (Canon) on Jun 03, 2008 at 14:13 UTC
There are better algorithms to find the mediam (or, in general, any element at position nth). The one you are proposing is still O(NlogN) and it can be done in O(N). See the Selection Algorithm entry in the Wikipedia. In Perl, you can use my module Sort::Key::Top to select the top n elements from a list. For instance, to get the median from a list of numbers: `use List::Util qw(shuffle max); use Sort::Key::Top qw(ntop); my @data = shuffle (0..100); my $mediam = ntop -1, ntop 1 + @data / 2, @data; # O(N) # or # $mediam = max ntop 1 + @data / 2, @data; # O(N) # $mediam = (ntopsort 1 + @data / 2, @data)[-1]; # O(NlogN)! print "mediam: $mediam\n";` [download] Note that `ntop` returns the n top elements unsorted and has O(N) complexity. `ntopsort` returns them sorted, but it has O(NlogN) complexity. update: BTW, the `n` in `ntop` doesn't stand for `nth` but for `numeric`!	[reply] [d/l] [select]
Re: A heap of medians: efficiency vs. speed by ambrus (Abbot) on Jun 03, 2008 at 14:26 UTC
Note how you can find the median asymptotically faster than do a half-sort. Indeed, the c++ stdlib has functions both for half-sort and for median. If you want to find the median fast, see Re: Puzzle: The Ham Cheese Sandwich cut. and other nodes in its thread (that thread has an easy to remember title). If you want to do a half-sort with a heap, see Binary heap, but it actually also works to just find the median, partition the elements to ones greater and lessequal to it, and then sort one of the halves.	[reply]
Re: A heap of medians: efficiency vs. speed by moritz (Cardinal) on Jun 03, 2008 at 12:56 UTC
Nice meditation kyle. While reading it I was reminded of a rather interesting thread on p5p regarding heap sort. Short summary: some research folks found that they could reduce the number of comparisons in heapsort. John P. Linderman answered that it's all fine and good, but in real life heapsort suffers from bad caching characteristics, and thus can't really compete with mergesort and quicksort. Still it's a very nice technique, and I could imagine that heapsort beats other algorithms when you only need to sort half of the list. (Maybe you could also optimize quicksort to only find the median - any thoughts on that?)	[reply]
Re: A heap of medians: efficiency vs. speed by amarquis (Curate) on Jun 03, 2008 at 13:58 UTC
I'd like to say that, as a perl guy without a comp sci background, the only places I get this sort of information are Mastering Algorithms in Perl and nodes like these here, so I really appreciate them. Off to Wikipedia to read more, I suppose. "(Note that "heap" can also refer to an area of memory used for dynamic memory allocation.)" - This was useful because I came in here expecting databases stored in memory, and you were starting to weird me out talking about btrees :).	[reply]
Re^2: A heap of medians: efficiency vs. speed by kyle (Abbot) on Jun 03, 2008 at 14:25 UTC
I added that remark about the other meaning of "heap" after feedback from moritz, who was among several monks who helpfully reviewed this meditation before I posted it. A big thanks also should go to tye.	[reply]
Re: A heap of medians: efficiency vs. speed by thospel (Hermit) on Jun 11, 2008 at 13:58 UTC
As author of Heap::Simple I'd like to add the following observations: Make sure Heap::Simple::XS is used if you care about speed for this kind of operation. Use the (default) scalar element type if that is good enough. It's the fastest of the available element types. Use the `max_count` option if you only want an already known last m elements. This will save a lot of compares. In this case it also saves the extract loop. With these three getting the median of a heap of 1e6 integers is close in speed. Sort is still faster though. You can also add the `dirty` option. Heap::Simple::XS will then use direct numeric compares instead of calling perl's `>`. This breaks overload and ties though. That way the Heap::Simple solution is about twice as fast as sort. The break-even is somewhere in the ten thousands of elements. The resulting code looks something like this: `use List::Util qw(shuffle); use POSIX qw(ceil); use Heap::Simple::XS; my @list = shuffle(0..1e6); my $h = Heap::Simple::XS->new(max_count => ceil(@list/2), dirty => 1); $h->insert(@list); print $h->extract_top;` [download] This doesn't change the fact that for even quite big lists sort is just faster for this kind of problem. Heaps are mainly for when you need to know the smallest element at many points in time while the set of elements keeps changing. Or indeed for some cases where compare is very expensive, but even then there are better ways to do it with less compares if you just want the median. PS: Heap::Simple already uses special code to make an insert of many elements at once faster than repeatedly inserting one element. You seem to hint at an even faster algorithm but I don't see how to use Tarjan's algorithm for that. Can you give me a reference ?	[reply] [d/l] [select]
Re^2: A heap of medians: efficiency vs. speed by kyle (Abbot) on Jun 12, 2008 at 20:30 UTC
Thanks for your comments! I'm sorry to say I don't know any more about building heaps faster than the little bit I said in the article. I got what I wrote about Tarjan's algorithm from the Wikipedia article, Heap (data structure), but I haven't looked into it myself at all. Sorry I can't be more help.	[reply]
Re: A heap of medians: efficiency vs. speed by Anonymous Monk on Jan 13, 2009 at 17:57 UTC
"An unordered list can be made into a heap in O(n log n) time by walking from end to end doing the above O(log n) operation�so this is a relatively cheap O(n log n)." Actually there is a better bound. Namely, you can turn an unordered list into a heap on just O(n) instead of O(n log n).	[reply]
Re^2: A heap of medians: efficiency vs. speed by Anonymous Monk on Jan 13, 2009 at 18:11 UTC
Sorry, didn't read till the end of the article before posting the above. Anyway, there is no need for Tarjan's algorithm, since a tighter analysis of a simple heapify-up/heapify-down algorithm shows that it works in O(n). As for finding the median, this can be always done in O(n) using the so called Selection algorithm. (a particular variant - randomized selection - works very fast in practice) The only interesting problem is about having a structure such that several(many) extractions of the medians are O(log n) each, with as little overhead is possible for maintaining such a structure (under e.g. insertions of new elements).	[reply]
Re: A heap of medians: efficiency vs. speed by hbm (Hermit) on Jan 13, 2009 at 19:11 UTC
Another recent topic here led me to this article: www.sysarch.com/Perl/sort_paper.html. It shows that of two seemingly identical sorts: `@fast = sort @unsorted; @slow = sort { $a <=> $b } @unsorted;` [download] The first one is faster because it runs entirely in C in the Perl core, while the second is slower because the sortsub executes in Perl code. I was sorting tens of thousands of files by creation time in reverse chronological order, and was delighted to find the first one below faster than the second: `@fast = reverse sort @unsorted; @slow = sort { $b <=> $a } @unsorted;` [download] kyle, I wonder how your median_sort might benchmark if it used `my @s = sort @_;`?	[reply] [d/l] [select]
Re^2: A heap of medians: efficiency vs. speed by kyle (Abbot) on Jan 13, 2009 at 19:29 UTC
I'll start by pointing out that the two sorts you show are not identical. `use List::Util qw( shuffle ); my @n = shuffle 5 .. 15; print join( q{, }, sort @n ), "\n"; print join( q{, }, sort { $a <=> $b } @n ), "\n"; __END__ 10, 11, 12, 13, 14, 15, 5, 6, 7, 8, 9 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15` [download] That's because sort, by default, uses "`cmp`" rather than "`<=>`" for comparisons. That's sort of beside your point, however. A sort with the comparison unspecified will be faster than one that's explicit, and there's a convenient special case for "`reverse sort`". That's all good. It's sort of beside the point of my meditation, however. The point here was to compare efficiency and speed. With the tied scalars, I was able to show that the heap method was more efficient—it did fewer comparisons. In spite of that, it was still slower because sort does its comparisons so much faster due to being optimized C. In any case, thanks for your comment.	[reply] [d/l] [select]