go ahead... be a heretic PerlMonks

### Re: [OT] The statistics of hashing. (birthday)

by tye (Sage)
 on Apr 01, 2012 at 04:03 UTC ( #962829=note: print w/replies, xml ) Need Help??

in reply to [OT] The statistics of hashing.

Although this sounds similar to the birthday problem, it is mostly much simpler (and partly much harder).

The odds of a collision at each step are quite simply \$setBits/2**32. So the odds of having 4 collisions in a row are prod(@setBits)/2**(4*32). So, if after inserting 100 items, your 2**32-bit vectors have, respectively, 99, 100, 100, and 98 bits set, then the odds of a 4-fold collision is 99*100*100*98/2**(32*4) or about 1 in 1e31.

Now, the odds of there being 100 bits set after setting 100 of the bits, that is the same as the birthday problem. So, if you wanted to predict the likely odds of a collision after inserting, say, 1e4 strings, then you would first compute the odds of the full 1e4 different bits being set. The birthday problem tells us that the odds are about 98.84% that all 1e4 bits would be set in a given vector. Or that there is a 1.16% chance of at least one collision (in a given vector).

```Bits Collision
set  odds (2**32 bits)
---  ---------
1e3    0.01%
1e4    1.16%
1e5   68.8%
1e6  100% (to about 50 digits of accuracy)

As calculated using Math::BigApprox:

```use Math::BigApprox 'c';
my \$bits = 2**32;
my \$inserts = 1e5;
print 1-(c(\$bits-\$inserts)^c(\$bits-1))/c(\$bits)**\$inserts, \$/;

So, when you insert 1e6 strings, you are not going to end up with 1e6 bits set. But getting a precise value for "expected number of bits set" is rather complex. But I wouldn't bother. I'd just count the number of bits that are actually set as you go along.

Then, when one goes to insert the next string, just calculate \$b1*\$b2*\$b3*\$b4/2**(32*4) in order to get the odds of a false positive on that insertion.

But inserting about 1e6 1e9 strings into 2**32-bit vectors means that about 23% of the bits are going to be set. So, your 1e6th 1e9th insert is going to have about a 0.28% chance of a false positive.

If you want to calculate the odds of there having been at least one false positive across the entire sequence of 1e6 inserts, then you'd compute 1-prod(@odds) where @odds contains the odds of no false positive at each step. So \$odds[\$i] would be 1 - \$b1*\$b2*\$b3*\$b4/2**(32*4). So you can calculate such that way, by just keeping a running product of that value.

Note, however, that prod(@odds) will stay exactly 1.0 when using floating point until you get to the point that \$b1*\$b2*\$b3*\$b4/2**(32*4) creeps above about 1/2**\$mantissaBits, (which won't happen until about 1/2 million inserts for typical 'double'). I'm not sure how much of an impact that will have on the accuracy when getting to the only interesting odds in the late stages. But it seems it might be quite significant.

But you can compensate for this by instead tracking 1-prod(@odds). That starts out simply as 1*1*1*1/2**(32*4) after the first insert. To effectively calculate 1-prod(@odds[0..\$n]) when given 1-prod(@odds[0..\$n-1]) and \$odds[\$n] isn't hard. It is just \$prod += \$odds-\$prod*\$odds (and that last term doesn't matter for a very long time). Just to be clear, \$odds is just \$b1*\$b2*\$b3*\$b4/2**(32*4) where \$b1 is the number of bits set in the first vector.

So, doing that calculation as you go will give you the odds of having had at least one false positive so far. After 1e5 inserts it will only be about 6e-15, depending on how many bit collisions you run into by then.

You can also get an upper bound on it for however many strings you have by just running through the 1e9+ calculations assuming no single-bit collisions ever happen. I won't wait for that to finish running here.

But it turns out that I don't have to:

```Odds  Inserts
----  -------
1%  ~28e6
10%  ~45e6
50%  ~65e6
90%  ~83e6
99%  ~96e6

- tye

Replies are listed 'Best First'.
Re^2: [OT] The statistics of hashing. (birthday)
by BrowserUk (Pope) on Apr 01, 2012 at 08:13 UTC
```Odds  Inserts
----  -------
1%  ~28e6
10%  ~45e6
50%  ~65e6
90%  ~83e6
99%  ~96e6

By those values, the odds against not having seen a duplicate by the time you reached 100 million inserts are so low as to be a pretty damn good definition of 'impossible'.

And yet, empirically, none had been seen by the time I reached 779,967,210.

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

The reality is that I've only had 1323 collisions after 1.5 billion inserts. These are the last 8:

```...
1585072919
1585138596
1585355107
1585418593
1585422956
1585468018
1585558107
1585577984

They are coming much more regularly now, but they are actually still within the bounds of usability.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

You appear to have not fully understood several parts of what I wrote. (Which is not an insult.)

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

That actually is not a valid conclusion from anything I wrote.

The odds of the next insert being a collision after about 1.5e9 inserts is about (1.5e9/2**32)**4 or under 1.5%, and 98.5% is not "miniscule" (and about 1/100th as fast is not "almost the same rate"). But that is also ignoring that there won't actually be 1.5e9 bits set. But you make no mention of how many bits are set which is how you'd get an accurate calculations so I'm curious if you perhaps didn't fully appreciate that point (as opposed to some other reason like it perhaps not being easy to query the number of bits set in your current run, etc.).

Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

I suspect that is the case in the numbers you show it your reply above, because you don't show collisions happening on the order of every 1e2 inserts. With 10 hashes, the above odds go to around 2.7e-5. And that is in the ballpark of what you report seeing.

If you need some help redoing the calculations for 10 hashes instead of 4, then please speak up. I was not exhaustive in my explanations nor expansions of formulae, so I would expect more clarification could be required. I can also understand you not taking the time to seriously study what I wrote yet given how the numbers seem so far off from your observations.

- tye

You appear to have not fully understood several parts of what I wrote.

You are right. In as much as, even re-reading your post in its entirety, I fail to understand what this table of odds represents, if not your calculation for the use of 4x32-bit hashes:

```Odds  Inserts
----  -------
1%  ~28e6
10%  ~45e6
50%  ~65e6
90%  ~83e6
99%  ~96e6
Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

The (still) running program is using 10 hashes. But I am not yet convinced that the extra 6 hashes actually buys anything. See below.

The program has currently covered just under half of the maximum 2**32 trials and has found only 8359 collisions. The last 10 of which were at positions:

```1933560856
1933561794
1933600874
1933619074
1933633177
1933665039
1933675171
1933675769
1933690902
1933708679

I am reluctant to interrupt it before it reaches at least half way, and -- depending on the rate at which the rate of collisions increases, may let it run on to 3 billion. Hence, I cannot consider adding code to count the current bit set per vector as you described at this stage.

#### 10 hashes

As described in my OP, I have (artificially) expanded the number of subhashes, by permuting the 4 natural 32-bit sub-divisions of the MD5 and XORing them to produce the other 6.

But, as moritz suggests here, it is not at all clear that this process actually decreases the odds of collisions, because it doesn't increase the amount of information (he calls it entropy, but I'm uncomfortable with that term in this context), available.

It is my intention to verify that during the second pass (to isolate how many of the collisions are actual dups and how many are false positives). At the same time, I will also perform the collision check using just the 4 natural hashes and that will tell me whether the permuted XORs buy me anything or not. But that will take another week or so.

If you need some help redoing the calculations

If you could distill your write-up to a formula, that would be helpful.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962829]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2021-09-26 23:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?