http://qs321.pair.com?node_id=938564


in reply to "Just use a hash": An overworked mantra?

But, if I can quote from Donald_Knuth "Premature optimization is the root of all evil."

You need to profile the real code to find out where the bottleneck is, As BrowserUK points out the time to read the file swamps any sort of improvement you might make.

Using a hash as a default option is still probably the right one, it's simple, understandable and works when your data is strings, or dates or anything else. An array on the other hand can only be used when your data is an integer, and you know that the range of the data is small. So in this very specific case a array will be better but not in the general case.

Thanks for the interesting post, but to answer your question -- NO it's not! :)

Replies are listed 'Best First'.
Re^2: "Just use a hash": An overworked mantra?
by blakew (Monk) on Nov 17, 2011 at 18:00 UTC
    "An array on the other hand can only be used when your data is an integer"

    I think you meant "maps 1:1 with integers."

      In this case, "data" is a bunch of integers. In moving from a hash to an array, the "keys" do have to be integers. I you're just counting, nothing else matters, but if it is about key-value pairs, that move is still valid if just the key is a (positive) integer. The value(s) in that pair do not have to be.

      Another thing not yet mentioned is that with datasets this large, not only the data itself may put a limit on the internal available memory footprint, but the overhead in perl structures add to that. Just today I checked what the internal representation of a 1 Mb .csv file was represented as an array(ref) of array(ref)s: it grew to 10Mb! A hash takes slightly more overhead than an array (most overhead goes into converting a single number into a refcounted SV), so when on the verge of swapping, an array might actually be much faster than a hash.


      Enjoy, Have FUN! H.Merijn
        Your data can be characters; in which case use ord to map to integers for the key. The point is your data just needs to be mappable to integers, not necessary integers themselves.