Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Wikipedia is correct (well, agrees quite closely with some of my calculations). But you have extrapolated from that quite incorrectly.

0.75**4 is the odds of, after inserting 1.1e5 strings, that there has been at least one collision in each of the four hashes. It is not even close to the odds of there being a single string that produced a simultaneous collision in all 4 hashes.

Those odds would be more like 0.75*(0.75/1.1e5)**3 or about 2e-16. My calculations came up with an upper bound of 9.5e-15.

I believe the 9.5e-15 number is much, much more accurate. It only discounts the possibility of there being fewer than 1.1e5 bits set. I suspect the ratio of "expected bits set"/1.1e5 to be very close to 1.

While the 2e-16 number is based on there being a 75% chance of exactly one collision. But there is a 25% chance of 0 collisions and a 75% chance of 1 or more collisions. And the possibility of 2 collisions approximately doubles the odds for some subset of the possible outcomes and so has a much bigger impact on the final total than does the possibility of there being 1.1e5-1 bits set instead of 1.1e5.

For example, consider tossing a coin twice. There is a 75% chance of getting at least one "head". Do 4 such trials. For each head, assign it a random position in 1.1e5 slots (but only one head per slot per trial). The odds of the 4 trials producing 4 heads in one of those slots (one from each trial) one might estimate as:

.75* (.75/x) * (.75/x) * (.75/x)

Where x==1.1e5. Or, 2.3e-16. But to find the accurate odds you have to realize that there is a 50% chance of one "head" (per trial) and a 25% change two "heads" (per trial). So the actual odds are found via:

.5 * (.5/x) * (.5/x) * (.5/x) + .5 * (.25*2/x) * (.5/x) * (.5/x) + .5 * (.5/x) * (.25*2/x) * (.5/x) + .5 * (.25*2/x) * (.25*2/x) * (.5/x) + .5 * (.5/x) * (.5/x) * (.25*2/x) + .5 * (.25*2/x) * (.5/x) * (.25*2/x) + .5 * (.5/x) * (.25*2/x) * (.25*2/x) + .5 * (.25*2/x) * (.25*2/x) * (.25*2/x) + .25* (.5/x1) * (.5/x1) * (.5/x1) + .25* (.25*x12) * (.5/x1) * (.5/x1) + .25* (.5/x1) * (.25*x12) * (.5/x1) + .25* (.25*x12) * (.25*x12) * (.5/x1) + .25* (.5/x1) * (.5/x1) * (.25*x12) + .25* (.25*x12) * (.5/x1) * (.25*x12) + .25* (.5/x1) * (.25*x12) * (.25*x12) + .25* (.25*x12) * (.25*x12) * (.25*x12)

Where x1==1.1e5-1 and x12==1/(1.1e5-1)+1/(1.1e5-2). That raises the odds by a factor of 2.37 to 5.6e-16.

That type of magnification of the odds will be much more pronounced in the case originally discussed and I bet would result in nearly the 48-fold increase required to get to the odds of my upper bound calculation.

- tye        


In reply to Re^2: The statistics of hashing. (**4) by tye
in thread [OT] The statistics of hashing. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (2)
As of 2021-10-25 01:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (89 votes). Check out past polls.

    Notices?