Do you know where your variables are? PerlMonks

### Re^3: [OT] The statistics of hashing.

by BrowserUk (Pope)
 on Apr 01, 2012 at 09:04 UTC ( #962865=note: print w/replies, xml ) Need Help??

in reply to Re^2: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.

Thanks syphilis. Your calculations make sense to me. But I'm not sure that it gels with the actual data?

Assuming I've coded your formula correctly (maybe not!), then using 10 hashes & vectors, I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123.

By that point I had actually seen 13 collisions:

And looking at the figure for 4e9 := 0.00667569553892502, by which time the 10 vectors will be almost fully populated, it looks way too low to me?

I would have expected that calculation (for N=4e9) to have yielded odds of almost 1?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^4: [OT] The statistics of hashing.
by syphilis (Bishop) on Apr 01, 2012 at 19:33 UTC
I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123

That's not the probability of "having seen a dup", but the probability that the 1000000001st random selection of 10 numbers would be reported as a dup (ie the probability that each of the relevant bits in all 10 bit vectors was already set for that 1000000001st random selection of the 10 numbers).

If I get a chance I'll try to work out the probability of "having seen a dup" in the first 1e9 iterations. (But, judging by some of the figures being bandied about, it probably has little bearing on this actual case where we're looking at MD5 hashes instead of random selections.)

Cheers,
Rob
If I get a chance I'll try to work out the probability of "having seen a dup" in the first 1e9 iterations.

Thank you if you do find that time at some point. If you could also show your workings, I might be able to wrap my muddled brain around it and stand a chance of re-applying your derivation.

it probably has little bearing on this actual case where we're looking at MD5 hashes instead of random selections.

Whilst the MD5 hash is known to be imperfect, it has been well analysed and has been demonstrated to produce a close to perfect random distribution of bits from the input data by several practical measures.

Eg. If you take any single input text, and produce its MD5; and then vary a single bit in the input and produce a new MD5, then -- on average -- half of the bits in the new MD5 will have changed relative to the original.

And if you repeat that process -- varying a single bit in the input and then compare the original and new MD5s -- the average number of bits changed in the outputs will tend towards 1/2. That is about as good a measure of randomness as you can hope for from a deterministic process.

I am aware of the limitations on the distribution of the hashes when derived from a non-full spectrum of inputs; but given that 2**32 (the maximum capacity of the vectors), represent such a minuscule proportion of the 1e44 possible inputs, I'd have to be extremely unlucky in my random selection from the total inputs for the hashing bias to actually have a measurable affect upon the probabilities of false positives.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Let's look at the probability of getting "at least one dup" (instead of "exactly one dup").
Let's also initially deal with the case where we're selecting (at random) only one number (instead of 4 or 10) each time.

Let P(0) be the probability that the very first selection did not produce a duplicate:
P(0) = (4294967295/4294967296)**0 # == 1, obviously

Let P(1) be the probability that the second selection did not produce a duplicate:
P(1) = (4294967295/4294967296)**1

Let P(2) be the probability that the third selection did not produce a duplicate:
P(2) = (4294967295/4294967296)**2

and so on:
Let P(1e9 + 1) be the probability that the 1000000001st selection did not produce a duplicate:
P(1e9) = (4294967295/4294967296)**1e9

(In general terms, P(x-1) is simply the probability that none of the x-1 selections already made match the xth selection.)

Then the probability that we can make 1000000001 random selections in the range (1 .. 4294967296) and get zero duplicates is
P(0)*P(1)*P(2)*P(3)*...*P(1e9).
That equates to (4294967295/4294967296)**Z, where
Z = 0+1+2+3+...+1e9.

So, the probablility D that we can make 1000000001 selections and have at least 1 duplicate is
D = 1 - ((4294967295/4294967296)**Z)

If we're doing that 4-at-a-time, then we need to calculate D**4; doing it 10-at-a-time we calculate D**10.

Is that sane ? Does it produce sane results ? (I think it should, but I don't have time to check.)

10-MINUTES LATER AFTERTHOUGHT: I don't think the "D**4" and "D**10" calculations actually tell us what we want ... gotta think about it a bit more ...

Cheers,
Rob

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962865]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-09-22 17:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?