http://qs321.pair.com?node_id=962866


in reply to Re^2: The statistics of hashing.
in thread [OT] The statistics of hashing.

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority (short-lived events of vandalism notwithstanding).

But the question remains, how to calculate the probabilities of the mechanism.

The exact calculation involves some big numbers. But assuming that c($x, $y) is the "pick $y from $x" function used in combinatorics, then the probability of a collision for $n strings and an evenly distributed 32-bit hash function should be:

$p = 1 - ( factorial($n) * c(2**32, $n) / 365**$n )

Big numbers. Horrible to calculate. Can be approximated though...

sub e () { 2.718281828 } my $t = ($n**2) / (2**33); $p = 1 - ( e ** -$t );

Calculating $p is still horrible, but calculating $t is easier. If $t is above 20 then $p is 1.00000 when rounded to 6 significant figures.

Thus you can effectively be sure to have a collision with a 32-bit hash function once $t is above 20. You can figure out an $n which triggers $t to be 20 using:

$n = sqrt(20 * (2 ** 33));

It's about 414,000. So with 414,000 strings, you are effectively certain to get collision on a 32-bit hash function.

Where I think my reasoning and tye's differ (and tye is almost certainly correct here - blame it on me answering late at night) is that I was then looking at the probabilities that you will have had collisions in all four (or ten) hash functions at the end of the entire run. With even half a million strings, that is a given.

What you're actually doing is looking at events where a single string triggers a simultaneous collision in all the hash functions. I defer to tye's calculations for that.

perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'