Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^5: RFC: Is there a solution to the flaw in my hash mechanism? (And are there any others?)

by RichardK (Parson)
on May 30, 2015 at 14:24 UTC ( [id://1128402]=note: print w/replies, xml ) Need Help??


in reply to Re^4: RFC: Is there a solution to the flaw in my hash mechanism? (And are there any others?)
in thread RFC: Is there a solution to the flaw in my hash mechanism? (And are there any others?)

What about this instead?

for( my $i = 0; $i < 17; ++$i ) { my $j = $i; printf "%2u: %s\n", $i, join' ', map{ sprintf "%2u", $j = ( $j + 13 +) % 17 } 0 .. 16; }; 0: 13 9 5 1 14 10 6 2 15 11 7 3 16 12 8 4 0 1: 14 10 6 2 15 11 7 3 16 12 8 4 0 13 9 5 1 2: 15 11 7 3 16 12 8 4 0 13 9 5 1 14 10 6 2 3: 16 12 8 4 0 13 9 5 1 14 10 6 2 15 11 7 3 4: 0 13 9 5 1 14 10 6 2 15 11 7 3 16 12 8 4 5: 1 14 10 6 2 15 11 7 3 16 12 8 4 0 13 9 5 6: 2 15 11 7 3 16 12 8 4 0 13 9 5 1 14 10 6 7: 3 16 12 8 4 0 13 9 5 1 14 10 6 2 15 11 7 8: 4 0 13 9 5 1 14 10 6 2 15 11 7 3 16 12 8 9: 5 1 14 10 6 2 15 11 7 3 16 12 8 4 0 13 9 10: 6 2 15 11 7 3 16 12 8 4 0 13 9 5 1 14 10 11: 7 3 16 12 8 4 0 13 9 5 1 14 10 6 2 15 11 12: 8 4 0 13 9 5 1 14 10 6 2 15 11 7 3 16 12 13: 9 5 1 14 10 6 2 15 11 7 3 16 12 8 4 0 13 14: 10 6 2 15 11 7 3 16 12 8 4 0 13 9 5 1 14 15: 11 7 3 16 12 8 4 0 13 9 5 1 14 10 6 2 15 16: 12 8 4 0 13 9 5 1 14 10 6 2 15 11 7 3 16
  • Comment on Re^5: RFC: Is there a solution to the flaw in my hash mechanism? (And are there any others?)
  • Download Code

Replies are listed 'Best First'.
Re^6: RFC: Is there a solution to the flaw in my hash mechanism? (And are there any others?)
by BrowserUk (Patriarch) on May 30, 2015 at 15:10 UTC

    The problem with that is you lose the "de-clustering" effect. (Note: the regularity of the columns in your table; and the apparent "randomness" of mine.)

    That is, because the step size has become a constant -- albeit with a different offset for each input value $i -- consecutive inputs tend to cluster rather than getting evenly distributed.

    And research shows that the downside of clustering is an increase in retries. They not only start earlier (when the fill ratio is lower), the clusters tend to mean more retries before you find an empty slot.

    Of course, that only affects applications that tend to store consecutive inputs; and in the normal way of things, the use of a good hashing function can be used to negate it.

    But for my application, as the keys are themselves numbers -- and the priority is lookup performance -- it makes sense to avoid the cost of a hashing function and use the numbers (% table size) directly.

    For my application, the possibility for consecutive inputs is pretty much indeterminable; being a function of the statistical distribution of the DNA being processed; and its length -- quite literally a "how long is a (piece of) string" problem; but the possibility for large numbers of consecutive numbers being stored -- although they may be generated out of sequence, their effect is the same -- is sufficiently high that if there is an alternative that retains that de-clustering effect, I'd rather use it.

    I'm currently considering a special case & different code path for when the first probe calculation is 0. For example, use a simple linear probe (+1) for that case only. Or maybe +prime/2 or something.

    It does add a conditional test at the heart of both the insertion and lookup code; so I'd have to run some large scale simulations to see if the cost of the test was offset by the de-clustering effect.

    I've only read that the latter is beneficial, so I might be chasing a red-herring here. There seem to be several "good practices" regarding hashes that you can trace back to one basic source on the web; but can't find any supporting evidence for.

    Eg. If you search for the phrase "Item (3) has, allegedly, been shown to yield especially good results in practice.", you'll find many reiterations of the same information -- its hard to determine which was the original -- but nowhere can I find who alleged it; when; where; and based upon what evidence.".

    But that's the 'net in a nutshell. Do a search for "recipe", pick one at random; pick out a fairly unique phrase from that recipe and search for that and you'll often find a couple of hundred or more people claiming the same exact recipe as their own.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Sure, I think this is always a problem of hashing algorithms , picking the right hash is somewhere between hard and impossible. I total agree that the lack of evidence for many claims is appalling. It seems that many people are incapable or unwilling to perform even simplest of statistical analysis on their results.

      But as I said in an earlier thread, I'd use a b-tree for this. You have very specific requirements so will have to implement custom code to do what you need. The code for general purpose b-trees can be quite complex, but you can eliminate everything you don't need. Of course, I don't know the exact details of the problem, so it's only a suggestion of something else to think about ;)

        But as I said in an earlier thread, I'd use a b-tree for this ... so it's only a suggestion of something else to think about ;)

        Despite my earlier somewhat flippant remark about my history with B-trees, I did give them some thought at that time. This is the logic I used to exclude them for this.

        The main advantages B-trees have over binary trees are:

        1. Reduction in the number of pointers needed.

          But they still require quite a lot of pointers; which at 8-bytes per is a significant cost of memory.

        2. The "blocking" of similar values.

          With on-disk DBs, this leads to a significant saving in expensive reads from disk; but not applicable here.

          The increased locality of reference would play to the strengths of cpu caching; but it would require very careful design and tuning to get the most out of that. (Think Judy arrays complexity.)

        But in the final analysis; a B-tree is O(logN) for lookups; and any advantages B-tree might have over other trees is only titivating at the edges; some reduction in memory requirement; some potential for reduced cache misses.

        My (crude & flawed) test implementation of this hash has demonstrated O(1.021) average lookup probes for 175 million in a 200,000,033 array (87% fill); and both the insert & lookup code is trivial; and very fast.

        With a b-tree, the average lookup for 175 million requires 8.25 probes; and the structure would require 3 times the space.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1128402]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-24 10:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found