comment on

This has to be the best example of bad testing I've ever seen.

*_count(): Correct key count in RV.
This is a pointless and logically incorrect test.
What if one of the numbers in the range 1 .. 999 was never generated by your random number generator?
*_count(): Correct value count in RV.
You've just tested how many keys are in the hash or array. Do you expect the number of values to be different? Can they be different?
*_count(): max integer in range.
Again, a pointless and incorrect test.
It is perfectly legitimate -- even if unlikely -- for a sequence of random numbers to not contain the largest value possible.
*_count(): min integer in range.
Ditto.

Now for the benchmark.

Incorporating the random generation of the dataset into each of the benchmarks is like incorporating the building of the track into an F1 race. The generation of the test data completely obscures the thing you are trying to test.

This statement: "In the interest of factoring out commonality I used the same random number generating code for all three subs." is completely wrong.

That you are calling subroutines with the same apparent name: rand_ranged_value() completely misses the fact that the Perl code is actually calling a subroutine called XS_main_rand_ranged_value() which does a bunch of stuff both before and after it calls the subroutine that the C code calls directly:

XS(XS_main_rand_ranged_value)
{
#ifdef dVAR
    dVAR; dXSARGS;
#else
    dXSARGS;
#endif
    if (items != 2)
       croak_xs_usage(cv,  "min, max");
    {
    int    min = (int)SvIV(ST(0));
    int    max = (int)SvIV(ST(1));
    int    RETVAL;
    dXSTARG;

    RETVAL = rand_ranged_value(min, max);
    XSprePUSH; PUSHi((IV)RETVAL);
    }
    XSRETURN(1);
}
[download]

So what you are actually benchmarking is the C compilers ability to compile-time optimise (eg.inline) a 100e6 calls to another C subroutine versus Perl's inability to optimise (at runtime) 100e6 calls to an XS subroutine which unwraps some native values from their Perl scalar wrappers before calling (unoptimised) the C subroutine 100e6 times, before wrapping the resultant native integer back into a Perl scalar. The result is far more than 2 x times skewed in favour of the C code.

Which might be alright if that was what you set out to benchmark, but it isn't. The actual process of counting the unique integer values is simply lost in the noise.

If you have need to derive counts of unique values in a huge dataset of ints, then that huge dataset must already exist, so it won't need to be "generated". Which makes adding it into the benchmark ...

Equally significant is the fact that storing 100 million integers in memory (in Perl) would take at least 3.2GB. Which precludes using a 32-bit perl for the job. And of course, those values will need to come from somewhere. Probably a file. And once you add IO into the equation, the roughly 60% speed advantage of using an array over a hash for this purpose:

#! perl -slw
use strict;
use Math::Random::MT qw[ rand srand ];
use Benchmark qw[ cmpthese ];

sub aCount {
    my $aRef = shift;
    my @counts;
    ++$counts[ $aRef->[ $_ ] ] for 0 .. $#{ $aRef };
    return \@counts;
}

sub hCount {
    my $aRef = shift;
    my %counts;
    ++$counts{ $aRef->[ $_ ] } for 0 .. $#{ $aRef };
    return \%counts;
}


our $N //= 1e6;

our @rands; $#rands = $N;
$rands[ $_ ] = 1+ int( rand( 999 ) ) for 0 .. $N;

cmpthese -1, {
    array => q[ my $res = aCount( \@rands ); ],
    hash  => q[ my $res = hCount( \@rands ); ],
};

__END__
C:\test>junk1
        Rate  hash array
hash  3.94/s    --  -36%
array 6.15/s   56%    --

C:\test>junk1
        Rate  hash array
hash  3.88/s    --  -38%
array 6.23/s   61%    --
[download]

Will be completely lost in the noise of the IO anyway. And if the data is in a file, then you don't need to load it all into memory to perform the counting:

perl 
-nlE"++$h{ $_ } }{ say qq[$_ : $h{ $_ }] for sort{ $a <=> $b } keys %h
+" 
< theBigFileOfInts
[download]

So if you had to go through the process of building that 3.2GB array in order to use the C version, the time saved in C would be swamped by the time spent allocating the memory and building the array.

And if you decided to call an XS subroutine for every line of the file to avoid that, then the XS would fare very badly relative to either of the pure Perl methods.

So, to answer your title question "Just use a hash": An overworked mantra?: No, It isn't. Whilst for this specific dataset: a set consisting of a contiguous range of low value integers; and if the dataset is already in memory, using an array may have some performance advantage, for the general case "Use a hash" for this purpose is very good advice.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: "Just use a hash": An overworked mantra? by BrowserUk
in thread "Just use a hash": An overworked mantra? by davido

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks