Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: [OT] The statistics of hashing. (birthday)

by BrowserUk (Patriarch)
on Apr 01, 2012 at 08:13 UTC ( [id://962862]=note: print w/replies, xml ) Need Help??


in reply to Re: [OT] The statistics of hashing. (birthday)
in thread [OT] The statistics of hashing.

Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6

By those values, the odds against not having seen a duplicate by the time you reached 100 million inserts are so low as to be a pretty damn good definition of 'impossible'.

And yet, empirically, none had been seen by the time I reached 779,967,210.

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

The reality is that I've only had 1323 collisions after 1.5 billion inserts. These are the last 8:

... 1585072919 1585138596 1585355107 1585418593 1585422956 1585468018 1585558107 1585577984

They are coming much more regularly now, but they are actually still within the bounds of usability.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: [OT] The statistics of hashing. (4<10)
by tye (Sage) on Apr 01, 2012 at 19:00 UTC

    You appear to have not fully understood several parts of what I wrote. (Which is not an insult.)

    And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

    That actually is not a valid conclusion from anything I wrote.

    The odds of the next insert being a collision after about 1.5e9 inserts is about (1.5e9/2**32)**4 or under 1.5%, and 98.5% is not "miniscule" (and about 1/100th as fast is not "almost the same rate"). But that is also ignoring that there won't actually be 1.5e9 bits set. But you make no mention of how many bits are set which is how you'd get an accurate calculations so I'm curious if you perhaps didn't fully appreciate that point (as opposed to some other reason like it perhaps not being easy to query the number of bits set in your current run, etc.).

    Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

    I suspect that is the case in the numbers you show it your reply above, because you don't show collisions happening on the order of every 1e2 inserts. With 10 hashes, the above odds go to around 2.7e-5. And that is in the ballpark of what you report seeing.

    If you need some help redoing the calculations for 10 hashes instead of 4, then please speak up. I was not exhaustive in my explanations nor expansions of formulae, so I would expect more clarification could be required. I can also understand you not taking the time to seriously study what I wrote yet given how the numbers seem so far off from your observations.

    - tye        

      You appear to have not fully understood several parts of what I wrote.

      You are right. In as much as, even re-reading your post in its entirety, I fail to understand what this table of odds represents, if not your calculation for the use of 4x32-bit hashes:

      Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6
      Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

      The (still) running program is using 10 hashes. But I am not yet convinced that the extra 6 hashes actually buys anything. See below.

      The program has currently covered just under half of the maximum 2**32 trials and has found only 8359 collisions. The last 10 of which were at positions:

      1933560856 1933561794 1933600874 1933619074 1933633177 1933665039 1933675171 1933675769 1933690902 1933708679

      I am reluctant to interrupt it before it reaches at least half way, and -- depending on the rate at which the rate of collisions increases, may let it run on to 3 billion. Hence, I cannot consider adding code to count the current bit set per vector as you described at this stage.

      10 hashes

      As described in my OP, I have (artificially) expanded the number of subhashes, by permuting the 4 natural 32-bit sub-divisions of the MD5 and XORing them to produce the other 6.

      But, as moritz suggests here, it is not at all clear that this process actually decreases the odds of collisions, because it doesn't increase the amount of information (he calls it entropy, but I'm uncomfortable with that term in this context), available.

      It is my intention to verify that during the second pass (to isolate how many of the collisions are actual dups and how many are false positives). At the same time, I will also perform the collision check using just the 4 natural hashes and that will tell me whether the permuted XORs buy me anything or not. But that will take another week or so.

      If you need some help redoing the calculations

      If you could distill your write-up to a formula, that would be helpful.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        If you could distill your write-up to a formula, that would be helpful.

        I already did that for the most useful and most accurate calculation (as well as for other parts):

        It is just $prod += $odds-$pro­d*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

        Looking at what is not explicit there, the only thing I find moderately difficult to determine is that you can start with $prod initialized to 0. The value being calculated is the probability (a value between 0 and 1) that there has been at least one 4-hash collision (due to random chance) so far.

        The upper bound on this value that I pre-calculated by discounting the impact of single-bit collisions is achieved by simply assuming $b1 through $b4 are all equal to the number of insertions done so far.

        This is substantially accurate for quite a while. Making it more accurate is seriously difficult other than by computing run-specific odds as is done by the two formulae I quoted above.

        I find it difficult to predict how tightly this upper bound is likely to fit against some run-specific odds when the number of inserts is a significant fraction of the number of bits per hash, as you are trying to do.

        Your 10 hash-values via XOR is not something that I can predict the results of, other than that I expect it to lie somewhere between the two very remote extremes of "no better than 4 independent hash values" and "just as good as 10 fully independent hash values". Your results so far seem to be somewhat close to what I would expect if you have 10 fully independent hash values per string.

        So my calculations completely ignored the "10 hashes" addition to portion of your description as I see no way to accurately address it.

        If I wanted to gain additional buckets, then I would compute the additional hash values using a (moderately) cryptographically sound process. For example, you could compute md5($string) and md5('BrowserUk'.$string) to get 8 hashes that should be substantially independent (based on prior difficult analysis of the MD5 algorithm). Though, your experiment may prove that your strategy is sufficient.

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962862]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-03-28 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found