"Just use a hash": An overworked mantra?

A few days ago on Stackoverflow someone expressed a problem that we see all the time. He had a bunch of items, and he needed to know how many of each unique type he had. Predictably the predominant suggestions were to "Just use a hash." There was a code example given, and even a link to perlfaq4. Click "post". A job well done. Let's go enjoy a cold beverage.

But in this specific question, the devil was in the details. First, the user was determining a count how many times individual integers occurred in a large set of integers with a range of 1 through 999. Ok, it's starting to sound a little more like an array, although we don't really know from the question how thorough the distribution is. For sure if an array of 0..999 is used the first element is wasted. And possibly others. ...a potentially sparse array, once again sounds like a hash.

Oh, one more detail: The set of integers we're testing is 100 million large.

Let's look at that again: 100 million integers in the range of 1 through 999. How many of each value is represented in this set?

The idiomatic approach: Keeping in mind the powers of our language, and the pros of using well-understood idioms, the hash seemed like the most obvious approach. It works for just about any other situation where we need to count how many of each type of item we have. And indeed, it works in this situation too. But look at the size of the data set. 100M integers. Look again at the range about about 1000 "buckets". How long could it take to increment 1 of 1000 buckets 100 million times? Consider this code:

sub count_in_hash {
    my $num_ints = $_[0] // $datasize;
    my %container;
    $container{ rand_ranged_value( 1, 999 ) }++ 
        for( 1 .. $num_ints );
    return \%container;
}
[download]

On the faster of my computers it takes 22.8 seconds (time spent generating random ints from 1 .. 999 included).

If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles.
-- Tom "Duff's Device" Duff

The hash approach takes advantage of Perl's wonderful tool set to abstract away complexity. And it's usually pretty efficient. Iterating through 100M integers is an O(n) operation. Incrementing an individual hash element's value is an O(1) operation in the aggregate. You don't get much better than that if the data set isn't well predictable.

But just because hash key lookups/increments are O(1) operations, that doesn't make them computationally cheap. There is a lot of work going on behind the scenes. And that's a significant portion of the time spent in an algorithm that uses a hash to count set items. Big-O notation concerns itself with order of growth as the data set grows. It doesn't concern itself with how much work is being done, only with how that amount of work grows as 'n' grows. Individual hash inserts/lookups do not change significantly as 'n' grows. So we represent those operations as O(1). But we have to keep in the back of our minds that they're not computationally trivial.

As Tom Duff said, "if no better algorithm exists, you must trim cycles." Whether you use a hash or some other container, there's no significantly better algorithm than to look at each item, and increment a counter for that item. So we have to look at our container to find a way to trim cycles.

An array lookup is also O(1), but it turns out it's a much computationally cheaper task to perform these array operations. The reason that the common idiom isn't to use an array to deal with uniqueness or set item counts is because it doesn't adapt well to general cases. But this is a very specific case where the set range is comparatively small, and the data set is comparatively large.

So let's look at an array:

sub count_in_array {
    my $num_ints = $_[0] // $datasize;
    my @container;
    $container[ rand_ranged_value( 1, 999 ) ]++ 
        for( 1 .. $num_ints );
    return {    # Anonymous hash constructor.
        map{ 
                $container[$_] > 0 
                    ? ( $_ => $container[$_] ) 
                    : () 
        } 1 .. $#container
    };
}
[download]

On my system this sub takes about 12.4 seconds to tally 100 million integers in the range of 1 .. 999. That's an 83% improvement in computing speed compared against the hash.

Now we can all go home. We've cut our execution time almost in half by moving from a good "generalized" solution to a better "specific" solution (less general). But it's still taking 12.4 seconds. Can't we do better than that?

If this is part of the "critical 3%", maybe we should dig deeper and ask that question.

Yes, we can do better, but to do significantly better we need to look at another of Perl's strengths; XS. When has anyone ever called XS one of Perl's strengths? Ok, let's call it a tool instead, and the strength is that on CPAN there is an excellent tool designed to make XS easier to wield. Inline::C. Let's look at an implementation of the array approach using Inline::C:

int rand_ranged_value ( int min, int max )
{
    return ( rand() % ( max - min + 1 ) + min );
}

// Accepts smallest and largest bucket, and number of ints 
// to test/produce.
// Returns (on the stack) a list of int/count pairs (pull 
// into a hash).

void ints_per_bucket ( 
    int min_bucket, 
    int max_bucket, 
    int quantity_ints 
)
{
    int range_size = max_bucket + 1;
    int* int_sieve;
    int i;
    Newxz( int_sieve, range_size, int );
    if( !int_sieve ) 
        croak( 
            "ints_per_bucket: Couldn't allocate memory.\n"
        );
    for( i = 0; i < quantity_ints; i++ )
            int_sieve[ rand_ranged_value(min_bucket,max_bucket) ]++;
    Inline_Stack_Vars;
    Inline_Stack_Reset;
    for( i = 0; i < range_size; i++ )
        if( int_sieve[i] > 0 )
        {
            // Key ( integer value counted ).
            Inline_Stack_Push( 
                sv_2mortal( newSViv( i ) ) 
            );
            // Value ( count for integer value ).
            Inline_Stack_Push( 
                sv_2mortal( newSViv( int_sieve[i] ) ) 
            );
        }
    Safefree( int_sieve );
    Inline_Stack_Done;
}
[download]

Newxz() uses a Perl "exception-safe" tool to allocate memory from the heap and initialize it to zero. Safefree() frees it when we're done. All the Inline_Stack_....() calls deal with pushing our results onto the Perl subroutine return stack. The rest is plain old C. And the result is just under 2 seconds to test 100 million integers: better than a 1000% increase over the hash approach, and better than a 500% increase over the Perl array approach.

Here's the full code and trial run:

#!/usr/bin/env perl

use strict;
use warnings;
use Benchmark  qw/ cmpthese /;
use List::Util qw/ max min  /;
use Test::More;
use Readonly;

use Inline C => 'DATA';


Readonly my $datasize  => 100_000_000;
Readonly my $test_size => 500_000;

my %test_subs = (
    array_count     =>  \&count_in_array,
    hash_count      =>  \&count_in_hash,
    c_aray_cnt      =>  \&count_in_array_c,
);

while ( my( $name, $sref ) = each %test_subs ) {
    my $result = $sref->( $test_size );
    is( scalar keys   %{$result}, 999, 
        "$name(): Correct key count in RV."  
    );
    is( scalar values %{$result}, 999, 
        "$name(): Correct value count in RV."
    );
    is( max( keys %{$result} ),   999, 
        "$name(): max integer in range."     
    );
    is( min( keys %{$result} ),     1, 
        "$name(): min integer in range."     
    );
}
done_testing();    

cmpthese( 5, \%test_subs );


sub count_in_array {
    my $num_ints = $_[0] // $datasize;
    my @container;
    $container[ rand_ranged_value( 1, 999 ) ]++ 
        for( 1 .. $num_ints );
    return {    # Anonymous hash constructor.
        map{ 
                $container[$_] > 0 
                    ? ( $_ => $container[$_] ) 
                    : () 
        } 1 .. $#container
    };
}


sub count_in_hash {
    my $num_ints = $_[0] // $datasize;
    my %container;
    $container{ rand_ranged_value( 1, 999 ) }++ 
        for( 1 .. $num_ints );
    return \%container;
}


sub count_in_array_c {
    my $num_ints = $_[0] // $datasize;
    return { ints_per_bucket( 1, 999, $num_ints ) };
}

__DATA__
__C__

// Public functions:

void ints_per_bucket ( 
    int min_bucket, int max_bucket, int quantity_ints 
);
int rand_ranged_value ( int min, int max );


int rand_ranged_value ( int min, int max )
{
    return ( rand() % ( max - min + 1 ) + min );
}


// Accepts smallest and largest bucket, and number of ints 
//to test/produce.
// Returns (on the stack) a list of int/count pairs (pull 
//into a hash).
void ints_per_bucket ( 
    int min_bucket, 
    int max_bucket, 
    int quantity_ints 
)
{
    int range_size = max_bucket + 1;
    int* int_sieve;
    int i;
    Newxz( int_sieve, range_size, int );
    if( !int_sieve ) 
        croak(
            "ints_per_bucket: Couldn't allocate memory.\n" 
        );
    for( i = 0; i < quantity_ints; i++ )
            int_sieve[ rand_ranged_value(min_bucket,max_bucket) ]++;
    Inline_Stack_Vars;
    Inline_Stack_Reset;
    for( i = 0; i < range_size; i++ )
        if( int_sieve[i] > 0 )
        {
            // Key ( integer value counted ).
            Inline_Stack_Push( 
                sv_2mortal( newSViv( i ) ) 
            );
            // Value ( count for integer value ).
            Inline_Stack_Push( 
                sv_2mortal( newSViv( int_sieve[i] ) ) 
            );
        }
    Safefree( int_sieve );
    Inline_Stack_Done;
}
[download]

The trial run:

ok 1 - hash_count(): Correct key count in RV.
ok 2 - hash_count(): Correct value count in RV.
ok 3 - hash_count(): max integer in range.
ok 4 - hash_count(): min integer in range.
ok 5 - array_count(): Correct key count in RV.
ok 6 - array_count(): Correct value count in RV.
ok 7 - array_count(): max integer in range.
ok 8 - array_count(): min integer in range.
ok 9 - c_aray_cnt(): Correct key count in RV.
ok 10 - c_aray_cnt(): Correct value count in RV.
ok 11 - c_aray_cnt(): max integer in range.
ok 12 - c_aray_cnt(): min integer in range.
1..12
            s/iter  hash_count array_count  c_aray_cnt
hash_count    22.8          --        -45%        -92%
array_count   12.4         83%          --        -85%
c_aray_cnt    1.91       1093%        551%          --
[download]

In the interest of factoring out commonality I used the same random number generating code for all three subs. Its cost isn't insignificant, but at least it's similar for all three tests. All subs provide a hash-ref with key/value pairs representing integer buckets and counts.

So what's the point? Where a general solution is needed, where data is not "esoteric", and where simplicity and maintainability are important, by all means we should go on using the idiomatic tool. But we shouldn't do so without at least considering whether our specific problem is better suited to a less idiomatic approach. The Perl "array" approach is considerably faster for this particular data set. However, it is definitely less readable (and consequently maintainable). Perl Best Practices suggests that when cleverness must trump simplicity one should document it well, and keep it confined to a narrow segment of code.

It's no surprise that the C version was faster. I suppose my point in demonstrating it is that while the array approach is great, if our specific needs require something more than Perl provides natively, we shouldn't be afraid to use the tools Perl makes available to us to craft a more ideal solution to our particular problem. I also wanted to illustrate the ease with which we can test embedded C code using Test::More. When there is a need to get closer to the metal, we've got the tools available to us.

So is "Just use a hash." overworked? As long as it is seen as a common idiom to employ when the situation fits, no. Once it becomes a thoughtless mantra, perhaps.

Dave

Back to Meditations