http://qs321.pair.com?node_id=938456

A few days ago on Stackoverflow someone expressed a problem that we see all the time. He had a bunch of items, and he needed to know how many of each unique type he had. Predictably the predominant suggestions were to "Just use a hash." There was a code example given, and even a link to perlfaq4. Click "post". A job well done. Let's go enjoy a cold beverage.

But in this specific question, the devil was in the details. First, the user was determining a count how many times individual integers occurred in a large set of integers with a range of 1 through 999. Ok, it's starting to sound a little more like an array, although we don't really know from the question how thorough the distribution is. For sure if an array of 0..999 is used the first element is wasted. And possibly others. ...a potentially sparse array, once again sounds like a hash.

Oh, one more detail: The set of integers we're testing is 100 million large.

Let's look at that again: 100 million integers in the range of 1 through 999. How many of each value is represented in this set?

The idiomatic approach: Keeping in mind the powers of our language, and the pros of using well-understood idioms, the hash seemed like the most obvious approach. It works for just about any other situation where we need to count how many of each type of item we have. And indeed, it works in this situation too. But look at the size of the data set. 100M integers. Look again at the range about about 1000 "buckets". How long could it take to increment 1 of 1000 buckets 100 million times? Consider this code:

sub count_in_hash { my $num_ints = $_[0] // $datasize; my %container; $container{ rand_ranged_value( 1, 999 ) }++ for( 1 .. $num_ints ); return \%container; }

On the faster of my computers it takes 22.8 seconds (time spent generating random ints from 1 .. 999 included).

If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles.

-- Tom "Duff's Device" Duff

The hash approach takes advantage of Perl's wonderful tool set to abstract away complexity. And it's usually pretty efficient. Iterating through 100M integers is an O(n) operation. Incrementing an individual hash element's value is an O(1) operation in the aggregate. You don't get much better than that if the data set isn't well predictable.

But just because hash key lookups/increments are O(1) operations, that doesn't make them computationally cheap. There is a lot of work going on behind the scenes. And that's a significant portion of the time spent in an algorithm that uses a hash to count set items. Big-O notation concerns itself with order of growth as the data set grows. It doesn't concern itself with how much work is being done, only with how that amount of work grows as 'n' grows. Individual hash inserts/lookups do not change significantly as 'n' grows. So we represent those operations as O(1). But we have to keep in the back of our minds that they're not computationally trivial.

As Tom Duff said, "if no better algorithm exists, you must trim cycles." Whether you use a hash or some other container, there's no significantly better algorithm than to look at each item, and increment a counter for that item. So we have to look at our container to find a way to trim cycles.

An array lookup is also O(1), but it turns out it's a much computationally cheaper task to perform these array operations. The reason that the common idiom isn't to use an array to deal with uniqueness or set item counts is because it doesn't adapt well to general cases. But this is a very specific case where the set range is comparatively small, and the data set is comparatively large.

So let's look at an array:

sub count_in_array { my $num_ints = $_[0] // $datasize; my @container; $container[ rand_ranged_value( 1, 999 ) ]++ for( 1 .. $num_ints ); return { # Anonymous hash constructor. map{ $container[$_] > 0 ? ( $_ => $container[$_] ) : () } 1 .. $#container }; }

On my system this sub takes about 12.4 seconds to tally 100 million integers in the range of 1 .. 999. That's an 83% improvement in computing speed compared against the hash.

Now we can all go home. We've cut our execution time almost in half by moving from a good "generalized" solution to a better "specific" solution (less general). But it's still taking 12.4 seconds. Can't we do better than that?

If this is part of the "critical 3%", maybe we should dig deeper and ask that question.

Yes, we can do better, but to do significantly better we need to look at another of Perl's strengths; XS. When has anyone ever called XS one of Perl's strengths? Ok, let's call it a tool instead, and the strength is that on CPAN there is an excellent tool designed to make XS easier to wield. Inline::C. Let's look at an implementation of the array approach using Inline::C:

int rand_ranged_value ( int min, int max ) { return ( rand() % ( max - min + 1 ) + min ); } // Accepts smallest and largest bucket, and number of ints // to test/produce. // Returns (on the stack) a list of int/count pairs (pull // into a hash). void ints_per_bucket ( int min_bucket, int max_bucket, int quantity_ints ) { int range_size = max_bucket + 1; int* int_sieve; int i; Newxz( int_sieve, range_size, int ); if( !int_sieve ) croak( "ints_per_bucket: Couldn't allocate memory.\n" ); for( i = 0; i < quantity_ints; i++ ) int_sieve[ rand_ranged_value(min_bucket,max_bucket) ]++; Inline_Stack_Vars; Inline_Stack_Reset; for( i = 0; i < range_size; i++ ) if( int_sieve[i] > 0 ) { // Key ( integer value counted ). Inline_Stack_Push( sv_2mortal( newSViv( i ) ) ); // Value ( count for integer value ). Inline_Stack_Push( sv_2mortal( newSViv( int_sieve[i] ) ) ); } Safefree( int_sieve ); Inline_Stack_Done; }

Newxz() uses a Perl "exception-safe" tool to allocate memory from the heap and initialize it to zero. Safefree() frees it when we're done. All the Inline_Stack_....() calls deal with pushing our results onto the Perl subroutine return stack. The rest is plain old C. And the result is just under 2 seconds to test 100 million integers: better than a 1000% increase over the hash approach, and better than a 500% increase over the Perl array approach.

Here's the full code and trial run:

#!/usr/bin/env perl use strict; use warnings; use Benchmark qw/ cmpthese /; use List::Util qw/ max min /; use Test::More; use Readonly; use Inline C => 'DATA'; Readonly my $datasize => 100_000_000; Readonly my $test_size => 500_000; my %test_subs = ( array_count => \&count_in_array, hash_count => \&count_in_hash, c_aray_cnt => \&count_in_array_c, ); while ( my( $name, $sref ) = each %test_subs ) { my $result = $sref->( $test_size ); is( scalar keys %{$result}, 999, "$name(): Correct key count in RV." ); is( scalar values %{$result}, 999, "$name(): Correct value count in RV." ); is( max( keys %{$result} ), 999, "$name(): max integer in range." ); is( min( keys %{$result} ), 1, "$name(): min integer in range." ); } done_testing(); cmpthese( 5, \%test_subs ); sub count_in_array { my $num_ints = $_[0] // $datasize; my @container; $container[ rand_ranged_value( 1, 999 ) ]++ for( 1 .. $num_ints ); return { # Anonymous hash constructor. map{ $container[$_] > 0 ? ( $_ => $container[$_] ) : () } 1 .. $#container }; } sub count_in_hash { my $num_ints = $_[0] // $datasize; my %container; $container{ rand_ranged_value( 1, 999 ) }++ for( 1 .. $num_ints ); return \%container; } sub count_in_array_c { my $num_ints = $_[0] // $datasize; return { ints_per_bucket( 1, 999, $num_ints ) }; } __DATA__ __C__ // Public functions: void ints_per_bucket ( int min_bucket, int max_bucket, int quantity_ints ); int rand_ranged_value ( int min, int max ); int rand_ranged_value ( int min, int max ) { return ( rand() % ( max - min + 1 ) + min ); } // Accepts smallest and largest bucket, and number of ints //to test/produce. // Returns (on the stack) a list of int/count pairs (pull //into a hash). void ints_per_bucket ( int min_bucket, int max_bucket, int quantity_ints ) { int range_size = max_bucket + 1; int* int_sieve; int i; Newxz( int_sieve, range_size, int ); if( !int_sieve ) croak( "ints_per_bucket: Couldn't allocate memory.\n" ); for( i = 0; i < quantity_ints; i++ ) int_sieve[ rand_ranged_value(min_bucket,max_bucket) ]++; Inline_Stack_Vars; Inline_Stack_Reset; for( i = 0; i < range_size; i++ ) if( int_sieve[i] > 0 ) { // Key ( integer value counted ). Inline_Stack_Push( sv_2mortal( newSViv( i ) ) ); // Value ( count for integer value ). Inline_Stack_Push( sv_2mortal( newSViv( int_sieve[i] ) ) ); } Safefree( int_sieve ); Inline_Stack_Done; }

The trial run:

ok 1 - hash_count(): Correct key count in RV. ok 2 - hash_count(): Correct value count in RV. ok 3 - hash_count(): max integer in range. ok 4 - hash_count(): min integer in range. ok 5 - array_count(): Correct key count in RV. ok 6 - array_count(): Correct value count in RV. ok 7 - array_count(): max integer in range. ok 8 - array_count(): min integer in range. ok 9 - c_aray_cnt(): Correct key count in RV. ok 10 - c_aray_cnt(): Correct value count in RV. ok 11 - c_aray_cnt(): max integer in range. ok 12 - c_aray_cnt(): min integer in range. 1..12 s/iter hash_count array_count c_aray_cnt hash_count 22.8 -- -45% -92% array_count 12.4 83% -- -85% c_aray_cnt 1.91 1093% 551% --

In the interest of factoring out commonality I used the same random number generating code for all three subs. Its cost isn't insignificant, but at least it's similar for all three tests. All subs provide a hash-ref with key/value pairs representing integer buckets and counts.

So what's the point? Where a general solution is needed, where data is not "esoteric", and where simplicity and maintainability are important, by all means we should go on using the idiomatic tool. But we shouldn't do so without at least considering whether our specific problem is better suited to a less idiomatic approach. The Perl "array" approach is considerably faster for this particular data set. However, it is definitely less readable (and consequently maintainable). Perl Best Practices suggests that when cleverness must trump simplicity one should document it well, and keep it confined to a narrow segment of code.

It's no surprise that the C version was faster. I suppose my point in demonstrating it is that while the array approach is great, if our specific needs require something more than Perl provides natively, we shouldn't be afraid to use the tools Perl makes available to us to craft a more ideal solution to our particular problem. I also wanted to illustrate the ease with which we can test embedded C code using Test::More. When there is a need to get closer to the metal, we've got the tools available to us.

So is "Just use a hash." overworked? As long as it is seen as a common idiom to employ when the situation fits, no. Once it becomes a thoughtless mantra, perhaps.


Dave

Replies are listed 'Best First'.
Re: "Just use a hash": An overworked mantra?
by BrowserUk (Patriarch) on Nov 16, 2011 at 21:49 UTC

    This has to be the best example of bad testing I've ever seen.

    1. *_count(): Correct key count in RV.

      This is a pointless and logically incorrect test.

      What if one of the numbers in the range 1 .. 999 was never generated by your random number generator?

    2. *_count(): Correct value count in RV.

      You've just tested how many keys are in the hash or array. Do you expect the number of values to be different? Can they be different?

    3. *_count(): max integer in range.

      Again, a pointless and incorrect test.

      It is perfectly legitimate -- even if unlikely -- for a sequence of random numbers to not contain the largest value possible.

    4. *_count(): min integer in range.

      Ditto.

    Now for the benchmark.

    Incorporating the random generation of the dataset into each of the benchmarks is like incorporating the building of the track into an F1 race. The generation of the test data completely obscures the thing you are trying to test.

    This statement: "In the interest of factoring out commonality I used the same random number generating code for all three subs." is completely wrong.

    That you are calling subroutines with the same apparent name: rand_ranged_value() completely misses the fact that the Perl code is actually calling a subroutine called XS_main_rand_ranged_value() which does a bunch of stuff both before and after it calls the subroutine that the C code calls directly:

    XS(XS_main_rand_ranged_value) { #ifdef dVAR dVAR; dXSARGS; #else dXSARGS; #endif if (items != 2) croak_xs_usage(cv, "min, max"); { int min = (int)SvIV(ST(0)); int max = (int)SvIV(ST(1)); int RETVAL; dXSTARG; RETVAL = rand_ranged_value(min, max); XSprePUSH; PUSHi((IV)RETVAL); } XSRETURN(1); }

    So what you are actually benchmarking is the C compilers ability to compile-time optimise (eg.inline) a 100e6 calls to another C subroutine versus Perl's inability to optimise (at runtime) 100e6 calls to an XS subroutine which unwraps some native values from their Perl scalar wrappers before calling (unoptimised) the C subroutine 100e6 times, before wrapping the resultant native integer back into a Perl scalar. The result is far more than 2 x times skewed in favour of the C code.

    Which might be alright if that was what you set out to benchmark, but it isn't. The actual process of counting the unique integer values is simply lost in the noise.

    If you have need to derive counts of unique values in a huge dataset of ints, then that huge dataset must already exist, so it won't need to be "generated". Which makes adding it into the benchmark ...

    Equally significant is the fact that storing 100 million integers in memory (in Perl) would take at least 3.2GB. Which precludes using a 32-bit perl for the job. And of course, those values will need to come from somewhere. Probably a file. And once you add IO into the equation, the roughly 60% speed advantage of using an array over a hash for this purpose:

    #! perl -slw use strict; use Math::Random::MT qw[ rand srand ]; use Benchmark qw[ cmpthese ]; sub aCount { my $aRef = shift; my @counts; ++$counts[ $aRef->[ $_ ] ] for 0 .. $#{ $aRef }; return \@counts; } sub hCount { my $aRef = shift; my %counts; ++$counts{ $aRef->[ $_ ] } for 0 .. $#{ $aRef }; return \%counts; } our $N //= 1e6; our @rands; $#rands = $N; $rands[ $_ ] = 1+ int( rand( 999 ) ) for 0 .. $N; cmpthese -1, { array => q[ my $res = aCount( \@rands ); ], hash => q[ my $res = hCount( \@rands ); ], }; __END__ C:\test>junk1 Rate hash array hash 3.94/s -- -36% array 6.15/s 56% -- C:\test>junk1 Rate hash array hash 3.88/s -- -38% array 6.23/s 61% --

    Will be completely lost in the noise of the IO anyway. And if the data is in a file, then you don't need to load it all into memory to perform the counting:

    perl -nlE"++$h{ $_ } }{ say qq[$_ : $h{ $_ }] for sort{ $a <=> $b } keys %h +" < theBigFileOfInts

    So if you had to go through the process of building that 3.2GB array in order to use the C version, the time saved in C would be swamped by the time spent allocating the memory and building the array.

    And if you decided to call an XS subroutine for every line of the file to avoid that, then the XS would fare very badly relative to either of the pure Perl methods.

    So, to answer your title question "Just use a hash": An overworked mantra?: No, It isn't. Whilst for this specific dataset: a set consisting of a contiguous range of low value integers; and if the dataset is already in memory, using an array may have some performance advantage, for the general case "Use a hash" for this purpose is very good advice.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I do appreciate you taking the time to look it over. I've had my own concerns as well, and value your input.

      Let's talk about the random number generator for a moment because I have concerns too. Let me start by saying that I agree 100% that it's an unfortunate compromise, and I would be interested in hearing alternatives. As you pointed out it's unlikely that anyone is holding onto 3.2GB of integers in memory, but I needed a source of data. The original problem stated the integers were held in files. It would probably be worthwhile for me to alter the benchmark to pull in integers from a file instead. But then I'm testing IO, not the algorithm. Nevertheless, the random number generator is a poor alternative.

      Before posting, I was already dissatisfied enough with it that I profiled the code to see what affect it was having on the overall algorithms. The calls to the random number generator were consuming about 1/3rd of the time spent in the Perl array-method comparison sub. In the hash method, it was a nearly equal number of seconds spent in the random number generator, so the ratio was lower. The ratio was about about 1/3rd of the time for the XS comparison too, and that alone pretty clearly demonstrated that when Perl called the random number generator it was more expensive than when C did. I considered posting profiling data as well, but really a better approach is needed altogether.

      However, what we do have is consistency in the Perl functions. The array method is calling the same random generator, with the same overhead as the hash method. I'm going to create an example here using contrived numbers from my own recollection since I am not looking directly at my original post or the profiling data right now. Let's say that we spend 1/3rd of 12.5 seconds in generating random numbers (that's pretty accurate). That's a little over 4 seconds. In the XS version of the test we spent about 2/3rds of a second. Let's factor that out. That gives us 20 seconds per iteration for the hash method, 8.5 seconds for the array method, and 1.4 seconds for each iteration of the XS method. Add back in whatever constant time you want for disk IO: 60 seconds per test iteration? The hash approach takes 80 seconds now, the array approach takes 72.5, and the XS approach takes 61.4. None is fast, but considering the relative ease with which the quicker approaches can be applied isn't the improvement significant enough to investigate?

      The data set was large enough that it would be a freak of improbability for a well-performing random generator to not at least return some values at the top most and bottom most positions in the range. I was using the tests to ensure that I had filled my allocated array from 1 to 999 without undershooting or overshooting it. If the tests showed I didn't hit the top and bottom, there were enough data pieces that it would be worth checking whether the generator was flawed. I wanted to also verify that when I hit the top of the array I didn't overshoot it. If I had, rather than the test failing I might have just crashed, but even if that happened, the tests served their purpose in exercising the code at its edges, and would have indicated to me where I was when such an event took place. Bad testing is none at all. This falls way short of good testing, but it was what I needed to check for some off-by-one possibilities. The key/value test was redundant though. And the tests were far from comprehensive. There was a narrow set of issues I wanted to double check.


      Dave

        I'm going to create an example here using contrived numbers from my own recollection since I am not looking directly at my original post or the profiling data right now....

        Sorry, but those projections, even as talking points are so far out as to be almost meaningless.

        A few real statistics:

        ## Gen a file of 100e6 random ints:1 .. 999 perl -E"say 1+ int( rand 999 ) for 1 .. 100e6" > rands.dat ## Now count them using IO and a hash and write the counts to a file: [22:59:34.88] c:\test>perl -nlE"++$h{ $_ } }{ say qq[$_ :: $h{ $_ }] f +or sort{$a<=>$b} keys %h" rands.dat > hash.res [23:00:16.83] c:\test> rem 41.95 seconds ## Same thing using an array: [23:06:58.13] c:\test>perl -nlE"++$h[ $_ ] }{ say qq[$_ :: $h[ $_ ]] f +or 0 .. $#h" rands.dat > array.res [23:07:38.02] c:\test> rem 39.49 seconds

        The timings of multiple runs is consistent to within 3 or 4 1/100ths for both methods and the 2 second difference on 100e6 data items is negligible. And even some of that difference is accounted for by the need to sort the keys prior to output in order that the result files are the same.

        the XS approach takes

        Sorry again, but you are still missing that to use XS code for this, you either:

        1. need to load the whole dataset into memory, which on my 4GB/64-bit machine takes over 10 minutes:
          [ 0:37:15.07] c:\test>perl -nE"$a[$.-1]=0+$_" rands.dat [ 0:47:23.27] c:\test>

          (It moves into swapping). And that's before you even start counting the integers.

        2. Or you need to call into the XS code for every line:

          Which means you are making 100e6 calls from perl to XS, and then 100e6 calls from there into C and results in the XS code taking half as long again as the pure Perl solutions:

          [ 0:32:32.61] c:\test>junk1-ic rands.dat 2>nul [ 0:33:28.69] c:\test>rem 56.08 seconds
        3. Or, you move the the file handling into the XS also.

          At which point you may as well skip the Perl and write the solution directly in C. And it is still doubtful if you would gain any significant savings on the pure Perl solutions, and would definitely have lost out when you consider the time taken to write and compile the C program, over using a Perl one-liner.

        The data set was large enough that it would be a freak of improbability for a well-performing random generator to not at least return some values at the top most and bottom most positions in the range. I was using the tests to ensure that I had filled my allocated array from 1 to 999 without undershooting or overshooting it.

        In random data, it is unlikely, but not impossible. As a place-holder for the data in the problem you started with, completely possible. Depending on how that data was derived, maybe even likely.

        In effect, you were conflating a test of your random number generator with a test of the implementation of the algorithm.

        If your purpose is to test that generator, far better to test it in isolation, running over a range of say 1..9 for a few thousand iterations and check not only that all the possible values are generated, but they occur with approximately equal frequency.

        IMO, adding these tests to the code for your meditation smacks of wanting to be seen to "do the right thing" rather than actually serving any real purpose for meditation itself.

        Don't take my critisisms as all negative.

        I realise -- from your recent posts -- that you are just getting into IC/XS and this probably started out as an experiment in seeing what you could do with IC. And for that I commend both your experiment and the posting of your findings in your meditation. I hope it will encourage others to take the step. This place would be all the better for more IC/XS coverage than it currently gets.

        Basically all I'm saying is that whilst I read your meditation with considerable interest, I find the case for your premise unproven by your methodology.

        That doesn't mean I don't think it is valuable. One of the first lessons you have to learn with IC is when it is worth going there and when not. In that respect, this is a great example of the care that must be taken when making that assessment.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        The original problem stated the integers were held in files. It would probably be worthwhile for me to alter the benchmark to pull in integers from a file instead. But then I'm testing IO, not the algorithm.

        Not really. You'd at least be testing the real-world scenario. And what you'd find is most likely that the IO so completely swamps everything else once you get into using a hash that further savings will be immaterial. If saving the 20 seconds that the hash costs is important, you could use IO::AIO to read in the data and insert it into the hash - if I read that module right, you should be able to be inserting and reading simultaneously, which now brings you solely to reading speed (whether local disk or over the network). I doubt it'll be important.

        Basically, this is the definition of premature optimisation. You're micro-optimising something that will have no effect on the overall speed because you're ignoring the elephant in the algorithm: disk IO.

        That and the payoff isn't that great even when you do make this optimisation. Dropping from the hash to the array is a huge savings for the cost involved. Dropping to XS is not as good because the development and debug time likely will completely dwarf the savings again :-) For the record, I expect using IO::AIO to be cheaper to implement than your XS version, give a negligibly better savings, and still not be worth it overall :-)

Re: "Just use a hash": An overworked mantra?
by davido (Cardinal) on Nov 17, 2011 at 01:22 UTC

    Well, you've all made some good points. :) Particularly that IO is the constraint.

    A few thoughts...

    Sometimes there is a better algorithm (as opposed to a better optimization). And that probably requires a better understanding of the core problem. For example, did the OP really need an exact tally, or would it have been adequate to sample and extrapolate? If so, what size sample would have been sufficient, and would it have been ok to take the first 1..s ints, or would it have been necessary to take s ints from random positions in the data set. We can't really answer that because we don't know enough about the data set.

    Another consideration could be maintaining a count as the data is generated. If over 3GB of integers was generated over the course of hours, days, weeks, or months, the extra cost of maintaining a running tally might go un-noticed, whereas a 60-second sudden lag may be a problem.

    And of course we don't know even if a sudden lag IS a problem. Perhaps the user runs this once, walks away to grab a coffee, and comes back to it later, never to run it again. Or perhaps it is something automated as a cron job for 4:27am. Maybe it's like my Windows computer where the world has already accepted that it takes 4 minutes to boot, and where shutting down may take even longer if some MS-pushed upgrade is waiting to finalize. Quite probably none of this matters to the person with the original question, and that being the case not only is my meditation premature micro-optimization, but any attempt to find a better alternative is still unwarranted.

    Nevertheless, I found it a fun exercise. If I ventured down the wrong tangent, I'll live and learn. In the end the hour or so spent investigating was educational enough I can justify it. And I'm thankful for the input, reminders, and gentle slap-in-the-back-of-the-head from my friends here.


    Dave

Re: "Just use a hash": An overworked mantra?
by RichardK (Parson) on Nov 17, 2011 at 09:36 UTC

    But, if I can quote from Donald_Knuth "Premature optimization is the root of all evil."

    You need to profile the real code to find out where the bottleneck is, As BrowserUK points out the time to read the file swamps any sort of improvement you might make.

    Using a hash as a default option is still probably the right one, it's simple, understandable and works when your data is strings, or dates or anything else. An array on the other hand can only be used when your data is an integer, and you know that the range of the data is small. So in this very specific case a array will be better but not in the general case.

    Thanks for the interesting post, but to answer your question -- NO it's not! :)

      "An array on the other hand can only be used when your data is an integer"

      I think you meant "maps 1:1 with integers."

        In this case, "data" is a bunch of integers. In moving from a hash to an array, the "keys" do have to be integers. I you're just counting, nothing else matters, but if it is about key-value pairs, that move is still valid if just the key is a (positive) integer. The value(s) in that pair do not have to be.

        Another thing not yet mentioned is that with datasets this large, not only the data itself may put a limit on the internal available memory footprint, but the overhead in perl structures add to that. Just today I checked what the internal representation of a 1 Mb .csv file was represented as an array(ref) of array(ref)s: it grew to 10Mb! A hash takes slightly more overhead than an array (most overhead goes into converting a single number into a refcounted SV), so when on the verge of swapping, an array might actually be much faster than a hash.


        Enjoy, Have FUN! H.Merijn
Re: "Just use a hash": An overworked mantra?
by Ratazong (Monsignor) on Nov 16, 2011 at 21:55 UTC

    Now that is a great meditation - davido++! Great content, and great proof that your conclusions are right!

    In my eyes, your main point think before using your mantra is the thing that distinguishes an expert from a wannabe.

    Few weeks ago I had a similar topic with a collegue who just joined the team. We sometimes have to lookup few values from one tool and put it in an excel-sheet. When being trained on this, his statement was "This needs to be automated. You need a tool for this.". A mantra which is probably right in 95% of the cases - but not in this case (extremely low effort for doing things by hand, data-structures of original data often change, highly-bureocratic IT limiting non-gui-access to the tool ...). Unfortunately, trying to persuade that guy that his mantra isn't always right took more time than doing the manual lookups/copies in a several months period ...

    Unfortunately, sometimes you can't get the mantra out of some peoples head :-(

    Rata
Re: "Just use a hash": An overworked mantra?
by fullermd (Priest) on Nov 18, 2011 at 19:00 UTC

    So is "Just use a hash." overworked? As long as it is seen as a common idiom to employ when the situation fits, no. Once it becomes a thoughtless mantra, perhaps.

    That's unnecessarily overqualified. Any time anything becomes a thoughtless mantra, it's a given that it's over- (and improperly-)worked.

    As I often say, the best thing about {rules, best practices, coding standards, etc} is that they let you do things without having to think about them. The worst thing about them is that they let you do things without having to think about them. (Shoot, there's a whole Meditation in ringing on the corollaries to that ;)

Re: "Just use a hash": An overworked mantra?
by sundialsvc4 (Abbot) on Nov 17, 2011 at 17:24 UTC

    When you are dealing with “huge” amounts of data, everything depends upon ... memory.   Do you have it, or do you not.   (And if it’s “virtual,” you don’t have it.)

    Many computers these days have truly vast amounts of RAM and there are a roomful of computers thusly equipped.   Under these circumstances, the chances are quite good that a tremendous structure can be built in RAM and that all of those pages will be (and will remain) present.   If that is known to be the case, then “in-memory” solutions work just fine, and yes they do behave very nicely as BrowserUK points out (with his characteristic I-love big-fonts flair ...).

    What becomes truly insidious about “in-memory” solutions, especially those based upon random-access data structures such as hashes, is when virtual-memory is constrained such that the entire block of data cannot fit into available physical RAM without incurring page faults.   A hash data-structure does not exhibit any locality of reference; quite the opposite.   Any reference to that structure could (worst-case) incur a page-fault, which suddenly transforms the entire algorithm from what you think is a fast, virtually I/O-free operation, into one that hammers your paging-device to death and brings the entire system to a screeching halt along with it.

    If you plot the performance curve of a virtual-storage system as the stress which is placed upon it increases, you will observe a line that basically increases in a nice, more-or-less linear fashion u-n-t-i-l it “hits the wall,” the so-called thrash point.   At this instant, the performance curve suddenly becomes exponential.   And that, as I’ve said before (from Ghostbusters), is “real wrath-of-God stuff.”

    BrowserUK is therefore entirely correct as long as you are well away from the thrash-point.   (And today, you might well be able to “throw cheap silicon at it” and thereby avoid the thrash-point entirely.   There is a reason why we have 64-bit systems now; soon to be 128.   Chips are cheap.)   But the punishment that can be inflicted, when and if it happens, is severe because it is exponential.

    In passing ... it is quite interesting that sorting a multi-million record file should take “ten minutes,” which is quite inexcusable.   There are interesting-looking articles here and also here.   Also specifically to our point, A Fresh Look at Efficient Perl Sorting, although it does not concern disk-sorts.

    A similar situation can happen with regard to accessing indexed files.   Once again we are dealing with a random-access data structure which may require some n physical I/O operations to retrieve the data, and which rewards locality-of-reference by virtue of cacheing recently-used index pages in RAM while discarding others.   Once again we have the “thrashing” phenomenon, albeit of a different kind and source.   Plentiful memory tends to mask the problem once again.   (Operating systems will dedicate leftover memory to file-buffering when there is no other competition for the space.)

    When and if you hit a thrash-point problem, you will know.   The difference can be a matter of many hours, or the difference between a job that finishes and one that does not.   “Ten minutes” (or more...) becomes an acceptable price to pay when for example you are talking about a massive runs-through-the-night production batch job.   And those, really, are the kind of situations I am talking about.   Not the size of problem that can be effectively dealt-with by buying more chips.   Obviously, “if you’ve got the RAM, flaunt it.”

      An array to hold 1000 integers requires 32k:

      @a = 1 .. 999;; print total_size \@a;; 32144

      A hash to hold 1000 keys & integer values requires 100k:

      $h{ $_ } = $_ for 1.. 999;; print total_size \%h;; 109055

      So, on a machine with less memory than some musical birthday cards, the hash or array one-liners will perform this task efficiently, and scale linearly for files containing at least 18,446,744,073,709,551,616 lines.

      To put that into perspective, it represents a single data file containing 16 Exabytes. Or approximately a million times more data than Google's entire storage capacity. If the OP can afford the amount of disk required to hold the file, it seems very unlikely that he'll have any trouble affording the 100k of ram.

      In the meantime, it would take a computer running at 10Ghz and able to perform 1 comparison per clock cycle, 3,741 years to sort that file, assuming no other time costs including IO or memory.

      So do the world a favour, and apply a little, the merest modicum, of thought to the problem at hand, before trotting out your olde worlde compooter wisdoms. Regurgitating received knowledge, long since superseded, as a substitute for actually thinking about the problem, does no one any good.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        For this problem, in which the known solution-space is constrained to what can fit into a reasonably sized hash and in which the total number of records and data-streams also fits into memory ... a memory-based solution works just fine, and there is utterly no reason to trundle out n-digit numbers to “prove” your point.

        My original comment, which I said even at that time was ancillary to the original discussion, is that there do exist other classes of problems which for various reasons do not lend themselves well to the “random-access based” (and to “memory-based”) approaches that might occur to you on first-blush.   This might not be one of those cases, but it does not invalidate the fact that such problems do exist.   In those problems, the incremental costs of virtual-memory activity become a death by a thousand cuts.   A fundamental change of approach in those cases transforms a process that runs for days, into one that runs in just a few hours.   I have seen it.   I have done it.   “Batch windows” are a reality for certain common business computing jobs.   Last year I worked on a system that processes more than a terabyte of new data, assimilated from hundreds of switching stations, every single day, and this was the change that gave them their system back.

        I was really, really hoping that in this case you wouldn’t rush out once again to prove how smart you are.   Let alone, as so many times before, publicly and at my expense.   Enough.

      Obviously, “if you’ve got the RAM, flaunt it.”

      Well, if you say so... :-D

      I found it interesting, so I tried some of BrowserUk's test scripts. I populated the rands.dat file - 0m48.948s. Then I loaded it into a hash - 0m37.042s. The array was significantly faster - 0m28.708s. Loading the data into an array took a fair bit of RAM, but since I am only using roughly 7GB of 12GB, I didn't encounter any swapping - 0m59.864s. Going on the first numbers, I must have something faster in my system already. However, not ten times faster. If you don't have the RAM, you may need to get it. :-)

      (with [BrowserUK's] characteristic I-love big-fonts flair ...)

      How very “ironic”, or, as one “path through the wood” might reveal, Very Ironic™.

Re: "Just use a hash": An overworked mantra?
by vkon (Curate) on Dec 23, 2011 at 12:38 UTC
    But just because hash key lookups/increments are O(1) operations, that doesn't make them computationally cheap.
    ......
    An array lookup is also O(1), but it turns out it's a much computationally cheaper task to perform these array operations.

    not correct

    Hash is not O(1).

    Also, this example of 0..999 integers looks too artifical to me, and it is too obvious to use an array here, so - for me - it does not deserve to even start the discussion, in the first place :)

    I do appreciate good English of the original post, however, which I, personally, do not possess
    :) :)

        yes. I am sure.

        first, the third link you've pointed reads:
        Thus—formally—hash tables often run in time O(log n).

        next, if the question is "is it possible to create an algorithm that, given arbitrary strings of finite length, make associative store/lookup for O(1)" - then the answer is "yes" but these are non-realistic algorithms.
        but - back to reality - real-life alhorithms are O(log N) in best case, but I afraid Perl's one is worse than that
        and - yet - have you ever heard of so-called "hash complexity attack" that was security-fixed by seed randomization in 5.8.0?
        To inform you, pre-5.8.0 were looking like O(1) but actually were O(n) (or maybe even worse in worst case scenario)

        next, your "stackoverflow.com" link contains simply wrong information.
        IOW, I do not see any proof there that convinces me.

        And - your last link does not even mention hashes.
        Bother explain what is doing here? TIA!

        and finally, do you see something strange in your phrase
        hash operations are O(1) operations, where 'n' is the number of elements in the hash
        ?
        I could be a bit joking, but I do not trust phrases like that.
        you're explaining parameter 'n' that is not presents in your formulae - what other mistakes/typos have you made?
        :)

Re: "Just use a hash": An overworked mantra?
by sundialsvc4 (Abbot) on Nov 17, 2011 at 00:49 UTC

    One comment that I would make, also, (tangental to the immediate discussion though it be ...) is that unexpectedly-good results can be obtained by using “an old COBOL trick,” namely, use an external disk-sort to sort the file first.   Then, you can simply read the file sequentially.   Every occurrence of every value in every group will be adjacent ... just count ’em up until the value changes (or until end-of-file).   Any gap in the sequence indicates a complete absence of values, anywhere in the original file, that falls into that gap.   The amount of memory required is:   “basically, none.”

    And, yes ... you can sort a file of millions of entries and you’ll be quite pleasantly surprised at how well even a run-of-the-mill disk sorter (or Perl module) can get the job done.   It isn’t memory intensive (although it will efficiently use what memory is available).   It is disk-space intensive, as it will create and discard many temporary spill-files, but moderately so.

    The same technique is also ideally suited to comparing large files, or for merging them, because there is no “searching” to be done at all.   Merely sort all of the files the same way, and the process is once again sequential.

      unexpectedly-good results can be obtained by using “an old COBOL trick,” namely, use an external disk-sort to sort the file first.

      How many times do you need to be told. No, they cannot! It takes at least 20 times longer!

      Proof.

      1. Using a hash takes 38.54 seconds:
        [ 1:27:09.40] c:\test>wc -l rands.dat 100000000 rands.dat [ 1:27:48.30] c:\test>perl -nlE"++$h[ $_ ]" rands.dat [ 1:28:27.24] c:\test>
      2. Just sorting the same file takes almost 10 minutes!:
        [ 1:29:54.32] c:\test>sort -n rands.dat >rands.dat.sorted [ 1:39:03.08] c:\test>

        And that before you run another process to perform the actual counting!

      To anyone with half a brain this is obvious.

      1. Using a hash requires:

        100e6 X the average time taken to read a record (IO).

        100e6 X the time taken to hash (H) the number and increment the value (i).

        ~Total time required: 100e6 * ( IO + H + I )

      2. Using a sort and then count (at least):

        100e6 X the average time taken to read a record (IOR).

        100e6 X log2( 100e6 ) = 2,657,542,476 X the time taken compare two lines (COMPL).

        100e6 X the average time taken to write a record (IOW).

        100e6 X the average time to read the sorted file (IOR)

        100e6 X the time taken to compare two lines (COMPL) + the time taken to increment a count (I) + the time taken to record that count (R).

        ~Total time required: 200e6*IOR + 100e6*IOW + 2,757e6*COMPL + 100e6*I + 100e6*R

        And that assumes that the whole dataset can be held and sorted in memory thus avoiding the additional, costly spill and merge stages. Which if it could there would be no point in using an external sort.

      And please note: This isn't a personal attack. I will respond in a similar fashion to anyone pushing this outdated piece of bad information. It only seems personal, because you keep on doing it!

      By now, I'm half tempted to believe you are only doing it to incur this response. But I dismiss that notion as it would require me to attribute you with some kind of Machiavellian intent, and I prefer to believe in Hanlon's Razor in these cases.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.