Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

use Benchmark 'cmpthese'; use List::Util 'sum'; our %hash = ( one => 1, two => 2, three => 3, four => 4, five => 5, six => 6, seven => 7, eight => 8, nine => 9, ten => 10, ); our @array = ( 1 .. 10 ); use constant { ONE => 0, TWO => 1, THREE => 2, FOUR => 3, FIVE => 4, SIX => 5, SEVEN => 6, EIGHT => 7, NINE => 8, TEN => 9, }; cmpthese ( -5, { hash => q/ my $n = sum( @hash{ 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten' } ); /, array => q/ my $n = sum( @array[ ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN ] ); /, } );


Rate hash array hash 1673833/s -- -46% array 3103809/s 85% --

I've got the 2nd and 6th edition of Learning Perl, and haven't been able to find anywhere in either edition where the book states that the efficiency advantages of arrays versus hashes manifest themselves when using shift/unshift/push//pop. Nor have I found anywhere in the book that states that a hash lookup in Perl costs the same as an array lookup. If you know of somewhere in that book I should be reading, please provide the exact quote, along with which edition I might find it in.

I think you're probably referring to this paragraph:

Some implementations of hashes (such as in the original awk language, where Larry borrowed the idea from) slow down as the hashes get larger and larger. This is not the case in Perl—it has a good, efficient, scalable algorithm. So, if a hash has only three key-value pairs, it’s very quick to “reach into the barrel” and pull out any one of those. If the hash has three million key-value pairs, it should be just about as quick to pull out any one of those. A big hash is nothing to fear.

That paragraph makes an assertion that the computational complexity of hash inserts or lookups is approximately constant, as the size of the hash grows toward infinity. And the same could be said of arrays. But order of growth in computational complexity says nothing about the amount of computational complexity there is that stays constant as the data set grows.

To understand this concept, look at the following two simple scenarios:

  • Find the numeric value 100 in an unsorted array of unique integers: We know that as the array grows, the growth in computational complexity is linear.
  • Find the string "David" in an unsorted array of strings.: We know that as the array grows, the growth in computational complexity is linear.

So those are both O(n) searches. However, to find numeric 100, we do a series of numeric comparisons while iterating over the elements of the array. To find "David", we must go to the first element and see if the first character in the first element is a "D". If it is, then we compare the second character and see if it is an "a". Each letter of the string results in another char comparison... essentially an integer comparison for each character of the string (short-circuiting as soon as there's a mismatch)... repeated for each element in the data set.

So it's pretty easy to see in this example that the string comparisons have an overhead per element that doesn't exist for the numeric comparisons.

Now look at what goes on in hash lookups versus array lookups. To look up a single element in an array, internally, Perl must first start with the base address of the array. Then it must add an integral offset to the array's base address. This offset will be the element number times the size of the element's structure. This is a simple mathematical calculation; base + offset * size. Once that calculation is performed, Perl can dereference a pointer contained in the element, and fetch the element's scalar entity.

Now for a Perl hash. First Perl must examine the string of some arbitrary length, and using a hashing algorithm to compute the bucket. That bucket may contain more than one element in the form of a linked list. The linked list would then be walked to arrive at the specific element being searched for. Fortunately the linked list chains are always kept pretty short, so we don't worry about their growth. But now instead of a simple base+offset*size equation, we feed a (usually) multi-byte string to a hashing algorithm, that gives us an offset, that leads us to a short chain of elements to scan through.

So the array lookup has some constant overhead, and the hash lookup has some constant overhead (overhead that doesn't change as the size of the structure grows). But the amount of constant overhead for the array is smaller than the amount of constant overhead for the hash.

We mostly don't worry about that. If we were looking for bleeding edge performance we would be using C or C++ to begin with. But the point to my mention of efficiency in this case was this: Sometimes function tables are used inside of tight loops. And infrequently, those tight loops become a performance problem. In such cases, switching from a hash based dispatch table to an array one will bring some performance improvement.

The more important point is that if one needs to assure true "switch-like" behavior, where the order of comparisons is predictable, an array is more appropriate; a hash based dispatch table doesn't guarantee a predictable order in which cases will be tested for a match.


In reply to Re^3: Given When Syntax by davido
in thread Given When Syntax by Deep_Plaid

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2022-08-10 10:42 GMT
Find Nodes?
    Voting Booth?

    No recent polls found