in reply to Re: [OT] The statistics of hashing. in thread [OT] The statistics of hashing.
IIUC it's just (N/4294967296)**4
I believe the figure I should have provided is:
(1  ((4294967295/4294967296) ** N)) ** 4
And if that's not right, I give up.
Update: The logic is simple. (Danger, Will Robinson ;) If you're picking numbers at random in the range (1 .. 4294967296), then the probability that the N+1th pick has already come up is:
1 minus the probability that it has *not* already come up and the probability that it has *not* already come up is (4294967295/4294967296) ** N
So that gives us the
1  ((4294967295/4294967296) ** N)
But because we're looking for the case where all *4* numbers have already come up, that probability needs to be raised to the 4th power ... which yields the final figure.
Cheers, Rob
Re^3: [OT] The statistics of hashing.
by BrowserUk (Pope) on Apr 01, 2012 at 09:04 UTC

Thanks syphilis. Your calculations make sense to me. But I'm not sure that it gels with the actual data?
Assuming I've coded your formula correctly (maybe not!), then using 10 hashes & vectors, I get the odds of having seen a dup after 1e9 inserts as (1  ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123.
By that point I had actually seen 13 collisions:
printf "After %5.5s inserts the odds of no dups: %.17f\n",
$_, ( 1  ( 0.99999999976716935634613037109375 ** $_ ) ) )**10
for map "${_}e8", 1..40;;
After 1e8 inserts the odds of no dups: 0.00000000000000004
After 2e8 inserts the odds of no dups: 0.00000000000003802
After 3e8 inserts the odds of no dups: 0.00000000000195354
After 4e8 inserts the odds of no dups: 0.00000000003092706
After 5e8 inserts the odds of no dups: 0.00000000025689926
After 6e8 inserts the odds of no dups: 0.00000000141936970
After 7e8 inserts the odds of no dups: 0.00000000591942653
After 8e8 inserts the odds of no dups: 0.00000002009607516
After 9e8 inserts the odds of no dups: 0.00000005831019113
After 10e8 inserts the odds of no dups: 0.00000014949378123 **
After 11e8 inserts the odds of no dups: 0.00000034677629037
After 12e8 inserts the odds of no dups: 0.00000074067890139
After 13e8 inserts the odds of no dups: 0.00000147618719432
After 14e8 inserts the odds of no dups: 0.00000277379244752
After 15e8 inserts the odds of no dups: 0.00000495435410008
After 16e8 inserts the odds of no dups: 0.00000846740103378
After 17e8 inserts the odds of no dups: 0.00001392227490012
After 18e8 inserts the odds of no dups: 0.00002212133818548
After 19e8 inserts the odds of no dups: 0.00003409433217130
After 20e8 inserts the odds of no dups: 0.00005113288027133
After 21e8 inserts the odds of no dups: 0.00007482409152212
After 22e8 inserts the odds of no dups: 0.00010708222525783
After 23e8 inserts the odds of no dups: 0.00015017742682200
After 24e8 inserts the odds of no dups: 0.00020676062948530
After 25e8 inserts the odds of no dups: 0.00027988383247111
After 26e8 inserts the odds of no dups: 0.00037301510162471
After 27e8 inserts the odds of no dups: 0.00049004779031911
After 28e8 inserts the odds of no dups: 0.00063530363659898
After 29e8 inserts the odds of no dups: 0.00081352955192231
After 30e8 inserts the odds of no dups: 0.00102988807161069
After 31e8 inserts the odds of no dups: 0.00128994158265037
After 32e8 inserts the odds of no dups: 0.00159963057715543
After 33e8 inserts the odds of no dups: 0.00196524629692540
After 34e8 inserts the odds of no dups: 0.00239339823431121
After 35e8 inserts the odds of no dups: 0.00289097703606607
After 36e8 inserts the odds of no dups: 0.00346511341973626
After 37e8 inserts the odds of no dups: 0.00412313375677405
After 38e8 inserts the odds of no dups: 0.00487251300375936
After 39e8 inserts the odds of no dups: 0.00572082567410912
After 40e8 inserts the odds of no dups: 0.00667569553892502
And looking at the figure for 4e9 := 0.00667569553892502, by which time the 10 vectors will be almost fully populated, it looks way too low to me?
I would have expected that calculation (for N=4e9) to have yielded odds of almost 1?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
 [reply] [d/l] 

I get the odds of having seen a dup after 1e9 inserts as (1  ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123
That's not the probability of "having seen a dup", but the probability that the 1000000001st random selection of 10 numbers would be reported as a dup (ie the probability that each of the relevant bits in all 10 bit vectors was already set for that 1000000001st random selection of the 10 numbers).
If I get a chance I'll try to work out the probability of "having seen a dup" in the first 1e9 iterations. (But, judging by some of the figures being bandied about, it probably has little bearing on this actual case where we're looking at MD5 hashes instead of random selections.)
Cheers, Rob
 [reply] 

If I get a chance I'll try to work out the probability of "having seen a dup" in the first 1e9 iterations.
Thank you if you do find that time at some point. If you could also show your workings, I might be able to wrap my muddled brain around it and stand a chance of reapplying your derivation.
it probably has little bearing on this actual case where we're looking at MD5 hashes instead of random selections.
Whilst the MD5 hash is known to be imperfect, it has been well analysed and has been demonstrated to produce a close to perfect random distribution of bits from the input data by several practical measures.
Eg. If you take any single input text, and produce its MD5; and then vary a single bit in the input and produce a new MD5, then  on average  half of the bits in the new MD5 will have changed relative to the original.
And if you repeat that process  varying a single bit in the input and then compare the original and new MD5s  the average number of bits changed in the outputs will tend towards 1/2. That is about as good a measure of randomness as you can hope for from a deterministic process.
I am aware of the limitations on the distribution of the hashes when derived from a nonfull spectrum of inputs; but given that 2**32 (the maximum capacity of the vectors), represent such a minuscule proportion of the 1e44 possible inputs, I'd have to be extremely unlucky in my random selection from the total inputs for the hashing bias to actually have a measurable affect upon the probabilities of false positives.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
 [reply] 


