Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: [OT] The statistics of hashing.

by syphilis (Bishop)
on Mar 31, 2012 at 23:13 UTC ( #962806=note: print w/replies, xml ) Need Help??


in reply to [OT] The statistics of hashing.

How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?

Update: Following figure is not right. (Corrected figure provided in following post.)

IIUC it's just (N/4294967296)**4, where N is the number of MD5 hashes that have already been entered into the bit vector.
But that's assuming that MD5 hashes distribute evenly, and I don't know if that has been established (or disproved). If they don't distribute evenly, then the odds of hitting a false positive will increase.

Not sure what affect your "Bloom Filters" variation would have.

Cheers,
Rob

Replies are listed 'Best First'.
Re^2: [OT] The statistics of hashing.
by syphilis (Bishop) on Mar 31, 2012 at 23:55 UTC
    IIUC it's just (N/4294967296)**4

    I believe the figure I should have provided is:
    (1 - ((4294967295/4294967296) ** N)) ** 4

    And if that's not right, I give up.

    Update: The logic is simple. (Danger, Will Robinson ;-)
    If you're picking numbers at random in the range (1 .. 4294967296), then the probability that the N+1th pick has already come up is:
    1 minus the probability that it has *not* already come up
    and the probability that it has *not* already come up is (4294967295/4294967296) ** N

    So that gives us the
    1 - ((4294967295/4294967296) ** N)

    But because we're looking for the case where all *4* numbers have already come up, that probability needs to be raised to the 4th power ... which yields the final figure.

    Cheers,
    Rob

      Thanks syphilis. Your calculations make sense to me. But I'm not sure that it gels with the actual data?

      Assuming I've coded your formula correctly (maybe not!), then using 10 hashes & vectors, I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123.

      By that point I had actually seen 13 collisions:

      printf "After %5.5s inserts the odds of no dups: %.17f\n", $_, ( 1 - ( 0.99999999976716935634613037109375 ** $_ ) ) )**10 for map "${_}e8", 1..40;; After 1e8 inserts the odds of no dups: 0.00000000000000004 After 2e8 inserts the odds of no dups: 0.00000000000003802 After 3e8 inserts the odds of no dups: 0.00000000000195354 After 4e8 inserts the odds of no dups: 0.00000000003092706 After 5e8 inserts the odds of no dups: 0.00000000025689926 After 6e8 inserts the odds of no dups: 0.00000000141936970 After 7e8 inserts the odds of no dups: 0.00000000591942653 After 8e8 inserts the odds of no dups: 0.00000002009607516 After 9e8 inserts the odds of no dups: 0.00000005831019113 After 10e8 inserts the odds of no dups: 0.00000014949378123 ** After 11e8 inserts the odds of no dups: 0.00000034677629037 After 12e8 inserts the odds of no dups: 0.00000074067890139 After 13e8 inserts the odds of no dups: 0.00000147618719432 After 14e8 inserts the odds of no dups: 0.00000277379244752 After 15e8 inserts the odds of no dups: 0.00000495435410008 After 16e8 inserts the odds of no dups: 0.00000846740103378 After 17e8 inserts the odds of no dups: 0.00001392227490012 After 18e8 inserts the odds of no dups: 0.00002212133818548 After 19e8 inserts the odds of no dups: 0.00003409433217130 After 20e8 inserts the odds of no dups: 0.00005113288027133 After 21e8 inserts the odds of no dups: 0.00007482409152212 After 22e8 inserts the odds of no dups: 0.00010708222525783 After 23e8 inserts the odds of no dups: 0.00015017742682200 After 24e8 inserts the odds of no dups: 0.00020676062948530 After 25e8 inserts the odds of no dups: 0.00027988383247111 After 26e8 inserts the odds of no dups: 0.00037301510162471 After 27e8 inserts the odds of no dups: 0.00049004779031911 After 28e8 inserts the odds of no dups: 0.00063530363659898 After 29e8 inserts the odds of no dups: 0.00081352955192231 After 30e8 inserts the odds of no dups: 0.00102988807161069 After 31e8 inserts the odds of no dups: 0.00128994158265037 After 32e8 inserts the odds of no dups: 0.00159963057715543 After 33e8 inserts the odds of no dups: 0.00196524629692540 After 34e8 inserts the odds of no dups: 0.00239339823431121 After 35e8 inserts the odds of no dups: 0.00289097703606607 After 36e8 inserts the odds of no dups: 0.00346511341973626 After 37e8 inserts the odds of no dups: 0.00412313375677405 After 38e8 inserts the odds of no dups: 0.00487251300375936 After 39e8 inserts the odds of no dups: 0.00572082567410912 After 40e8 inserts the odds of no dups: 0.00667569553892502

      And looking at the figure for 4e9 := 0.00667569553892502, by which time the 10 vectors will be almost fully populated, it looks way too low to me?

      I would have expected that calculation (for N=4e9) to have yielded odds of almost 1?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123

        That's not the probability of "having seen a dup", but the probability that the 1000000001st random selection of 10 numbers would be reported as a dup (ie the probability that each of the relevant bits in all 10 bit vectors was already set for that 1000000001st random selection of the 10 numbers).

        If I get a chance I'll try to work out the probability of "having seen a dup" in the first 1e9 iterations. (But, judging by some of the figures being bandied about, it probably has little bearing on this actual case where we're looking at MD5 hashes instead of random selections.)

        Cheers,
        Rob

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962806]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2020-09-25 17:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (138 votes). Check out past polls.

    Notices?