http://qs321.pair.com?node_id=962861


in reply to Re: The statistics of hashing.
in thread [OT] The statistics of hashing.

Wikipedia is correct (well, agrees quite closely with some of my calculations). But you have extrapolated from that quite incorrectly.

0.75**4 is the odds of, after inserting 1.1e5 strings, that there has been at least one collision in each of the four hashes. It is not even close to the odds of there being a single string that produced a simultaneous collision in all 4 hashes.

Those odds would be more like 0.75*(0.75/1.1e5)**3 or about 2e-16. My calculations came up with an upper bound of 9.5e-15.

I believe the 9.5e-15 number is much, much more accurate. It only discounts the possibility of there being fewer than 1.1e5 bits set. I suspect the ratio of "expected bits set"/1.1e5 to be very close to 1.

While the 2e-16 number is based on there being a 75% chance of exactly one collision. But there is a 25% chance of 0 collisions and a 75% chance of 1 or more collisions. And the possibility of 2 collisions approximately doubles the odds for some subset of the possible outcomes and so has a much bigger impact on the final total than does the possibility of there being 1.1e5-1 bits set instead of 1.1e5.

For example, consider tossing a coin twice. There is a 75% chance of getting at least one "head". Do 4 such trials. For each head, assign it a random position in 1.1e5 slots (but only one head per slot per trial). The odds of the 4 trials producing 4 heads in one of those slots (one from each trial) one might estimate as:

.75* (.75/x) * (.75/x) * (.75/x)

Where x==1.1e5. Or, 2.3e-16. But to find the accurate odds you have to realize that there is a 50% chance of one "head" (per trial) and a 25% change two "heads" (per trial). So the actual odds are found via:

.5 * (.5/x) * (.5/x) * (.5/x) + .5 * (.25*2/x) * (.5/x) * (.5/x) + .5 * (.5/x) * (.25*2/x) * (.5/x) + .5 * (.25*2/x) * (.25*2/x) * (.5/x) + .5 * (.5/x) * (.5/x) * (.25*2/x) + .5 * (.25*2/x) * (.5/x) * (.25*2/x) + .5 * (.5/x) * (.25*2/x) * (.25*2/x) + .5 * (.25*2/x) * (.25*2/x) * (.25*2/x) + .25* (.5/x1) * (.5/x1) * (.5/x1) + .25* (.25*x12) * (.5/x1) * (.5/x1) + .25* (.5/x1) * (.25*x12) * (.5/x1) + .25* (.25*x12) * (.25*x12) * (.5/x1) + .25* (.5/x1) * (.5/x1) * (.25*x12) + .25* (.25*x12) * (.5/x1) * (.25*x12) + .25* (.5/x1) * (.25*x12) * (.25*x12) + .25* (.25*x12) * (.25*x12) * (.25*x12)

Where x1==1.1e5-1 and x12==1/(1.1e5-1)+1/(1.1e5-2). That raises the odds by a factor of 2.37 to 5.6e-16.

That type of magnification of the odds will be much more pronounced in the case originally discussed and I bet would result in nearly the 48-fold increase required to get to the odds of my upper bound calculation.

- tye