Re^6: [OT] The statistics of hashing.

Let's look at the probability of getting "at least one dup" (instead of "exactly one dup").
Let's also initially deal with the case where we're selecting (at random) only one number (instead of 4 or 10) each time.

Let P(0) be the probability that the very first selection did not produce a duplicate:
P(0) = (4294967295/4294967296)**0 # == 1, obviously

Let P(1) be the probability that the second selection did not produce a duplicate:
P(1) = (4294967295/4294967296)**1

Let P(2) be the probability that the third selection did not produce a duplicate:
P(2) = (4294967295/4294967296)**2

and so on:
Let P(1e9 + 1) be the probability that the 1000000001st selection did not produce a duplicate:
P(1e9) = (4294967295/4294967296)**1e9

(In general terms, P(x-1) is simply the probability that none of the x-1 selections already made match the xth selection.)

Then the probability that we can make 1000000001 random selections in the range (1 .. 4294967296) and get zero duplicates is
P(0)*P(1)*P(2)*P(3)*...*P(1e9).
That equates to (4294967295/4294967296)**Z, where
Z = 0+1+2+3+...+1e9.

So, the probablility D that we can make 1000000001 selections and have at least 1 duplicate is
D = 1 - ((4294967295/4294967296)**Z)

If we're doing that 4-at-a-time, then we need to calculate D**4; doing it 10-at-a-time we calculate D**10.

Is that sane ? Does it produce sane results ? (I think it should, but I don't have time to check.)

10-MINUTES LATER AFTERTHOUGHT: I don't think the "D**4" and "D**10" calculations actually tell us what we want ... gotta think about it a bit more ...

Cheers,
Rob

Comment on Re^6: [OT] The statistics of hashing.


Problems? Is your data what you think it is?
	PerlMonks