http://qs321.pair.com?node_id=938502


in reply to "Just use a hash": An overworked mantra?

Well, you've all made some good points. :) Particularly that IO is the constraint.

A few thoughts...

Sometimes there is a better algorithm (as opposed to a better optimization). And that probably requires a better understanding of the core problem. For example, did the OP really need an exact tally, or would it have been adequate to sample and extrapolate? If so, what size sample would have been sufficient, and would it have been ok to take the first 1..s ints, or would it have been necessary to take s ints from random positions in the data set. We can't really answer that because we don't know enough about the data set.

Another consideration could be maintaining a count as the data is generated. If over 3GB of integers was generated over the course of hours, days, weeks, or months, the extra cost of maintaining a running tally might go un-noticed, whereas a 60-second sudden lag may be a problem.

And of course we don't know even if a sudden lag IS a problem. Perhaps the user runs this once, walks away to grab a coffee, and comes back to it later, never to run it again. Or perhaps it is something automated as a cron job for 4:27am. Maybe it's like my Windows computer where the world has already accepted that it takes 4 minutes to boot, and where shutting down may take even longer if some MS-pushed upgrade is waiting to finalize. Quite probably none of this matters to the person with the original question, and that being the case not only is my meditation premature micro-optimization, but any attempt to find a better alternative is still unwarranted.

Nevertheless, I found it a fun exercise. If I ventured down the wrong tangent, I'll live and learn. In the end the hour or so spent investigating was educational enough I can justify it. And I'm thankful for the input, reminders, and gentle slap-in-the-back-of-the-head from my friends here.


Dave