http://qs321.pair.com?node_id=962919


in reply to Re^3: [OT] The statistics of hashing. (4<10)
in thread [OT] The statistics of hashing.

You appear to have not fully understood several parts of what I wrote.

You are right. In as much as, even re-reading your post in its entirety, I fail to understand what this table of odds represents, if not your calculation for the use of 4x32-bit hashes:

Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6
Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

The (still) running program is using 10 hashes. But I am not yet convinced that the extra 6 hashes actually buys anything. See below.

The program has currently covered just under half of the maximum 2**32 trials and has found only 8359 collisions. The last 10 of which were at positions:

1933560856 1933561794 1933600874 1933619074 1933633177 1933665039 1933675171 1933675769 1933690902 1933708679

I am reluctant to interrupt it before it reaches at least half way, and -- depending on the rate at which the rate of collisions increases, may let it run on to 3 billion. Hence, I cannot consider adding code to count the current bit set per vector as you described at this stage.

10 hashes

As described in my OP, I have (artificially) expanded the number of subhashes, by permuting the 4 natural 32-bit sub-divisions of the MD5 and XORing them to produce the other 6.

But, as moritz suggests here, it is not at all clear that this process actually decreases the odds of collisions, because it doesn't increase the amount of information (he calls it entropy, but I'm uncomfortable with that term in this context), available.

It is my intention to verify that during the second pass (to isolate how many of the collisions are actual dups and how many are false positives). At the same time, I will also perform the collision check using just the 4 natural hashes and that will tell me whether the permuted XORs buy me anything or not. But that will take another week or so.

If you need some help redoing the calculations

If you could distill your write-up to a formula, that would be helpful.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^5: [OT] The statistics of hashing. (formula)
by tye (Sage) on Apr 01, 2012 at 21:32 UTC
    If you could distill your write-up to a formula, that would be helpful.

    I already did that for the most useful and most accurate calculation (as well as for other parts):

    It is just $prod += $odds-$pro­d*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

    Looking at what is not explicit there, the only thing I find moderately difficult to determine is that you can start with $prod initialized to 0. The value being calculated is the probability (a value between 0 and 1) that there has been at least one 4-hash collision (due to random chance) so far.

    The upper bound on this value that I pre-calculated by discounting the impact of single-bit collisions is achieved by simply assuming $b1 through $b4 are all equal to the number of insertions done so far.

    This is substantially accurate for quite a while. Making it more accurate is seriously difficult other than by computing run-specific odds as is done by the two formulae I quoted above.

    I find it difficult to predict how tightly this upper bound is likely to fit against some run-specific odds when the number of inserts is a significant fraction of the number of bits per hash, as you are trying to do.

    Your 10 hash-values via XOR is not something that I can predict the results of, other than that I expect it to lie somewhere between the two very remote extremes of "no better than 4 independent hash values" and "just as good as 10 fully independent hash values". Your results so far seem to be somewhat close to what I would expect if you have 10 fully independent hash values per string.

    So my calculations completely ignored the "10 hashes" addition to portion of your description as I see no way to accurately address it.

    If I wanted to gain additional buckets, then I would compute the additional hash values using a (moderately) cryptographically sound process. For example, you could compute md5($string) and md5('BrowserUk'.$string) to get 8 hashes that should be substantially independent (based on prior difficult analysis of the MD5 algorithm). Though, your experiment may prove that your strategy is sufficient.

    - tye        

      If you could distill your write-up to a formula, that would be helpful. -- I already did that

      I guess that your idea of a formula and mine are different.

      where $b1 is the number of bits set in the first vector.

      The "number of bits set" in each vector at any given insert is entirely dependent upon not just how many, but what values have already been inserted.

      Which means that to use your description to calculate the probabilities, I would need to iterate the entire thing and count the number of bits set in each vector at each iteration. And then those "calculated probabilities" would only be applicable to that particular sequence of inserts.

      At which point, I might just as well just record the actual numbers of false positives, rather than calculate them.

      You'll no doubt come back and tell me that I've completely misunderstood you (again), and that everything I need to know is right there if I would only read you correctly.

      Thank you for your attempts to assist me.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        You'll no doubt come back and tell me that I've completely misunderstood you (again)

        I never said that you completely misunderstood me. But I'll take your apparent lack of interest and leave it at that.

        - tye