Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Although this sounds similar to the birthday problem, it is mostly much simpler (and partly much harder).

The odds of a collision at each step are quite simply $setBits/2**32. So the odds of having 4 collisions in a row are prod(@setBits)/2**(4*32). So, if after inserting 100 items, your 2**32-bit vectors have, respectively, 99, 100, 100, and 98 bits set, then the odds of a 4-fold collision is 99*100*100*98/2**(32*4) or about 1 in 1e31.

Now, the odds of there being 100 bits set after setting 100 of the bits, that is the same as the birthday problem. So, if you wanted to predict the likely odds of a collision after inserting, say, 1e4 strings, then you would first compute the odds of the full 1e4 different bits being set. The birthday problem tells us that the odds are about 98.84% that all 1e4 bits would be set in a given vector. Or that there is a 1.16% chance of at least one collision (in a given vector).

Bits Collision set odds (2**32 bits) --- --------- 1e3 0.01% 1e4 1.16% 1e5 68.8% 1e6 100% (to about 50 digits of accuracy)

As calculated using Math::BigApprox:

use Math::BigApprox 'c'; my $bits = 2**32; my $inserts = 1e5; print 1-(c($bits-$inserts)^c($bits-1))/c($bits)**$inserts, $/;

So, when you insert 1e6 strings, you are not going to end up with 1e6 bits set. But getting a precise value for "expected number of bits set" is rather complex. But I wouldn't bother. I'd just count the number of bits that are actually set as you go along.

Then, when one goes to insert the next string, just calculate $b1*$b2*$b3*$b4/2**(32*4) in order to get the odds of a false positive on that insertion.

But inserting about 1e6 1e9 strings into 2**32-bit vectors means that about 23% of the bits are going to be set. So, your 1e6th 1e9th insert is going to have about a 0.28% chance of a false positive.

If you want to calculate the odds of there having been at least one false positive across the entire sequence of 1e6 inserts, then you'd compute 1-prod(@odds) where @odds contains the odds of no false positive at each step. So $odds[$i] would be 1 - $b1*$b2*$b3*$b4/2**(32*4). So you can calculate such that way, by just keeping a running product of that value.

Note, however, that prod(@odds) will stay exactly 1.0 when using floating point until you get to the point that $b1*$b2*$b3*$b4/2**(32*4) creeps above about 1/2**$mantissaBits, (which won't happen until about 1/2 million inserts for typical 'double'). I'm not sure how much of an impact that will have on the accuracy when getting to the only interesting odds in the late stages. But it seems it might be quite significant.

But you can compensate for this by instead tracking 1-prod(@odds). That starts out simply as 1*1*1*1/2**(32*4) after the first insert. To effectively calculate 1-prod(@odds[0..$n]) when given 1-prod(@odds[0..$n-1]) and $odds[$n] isn't hard. It is just $prod += $odds-$prod*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

So, doing that calculation as you go will give you the odds of having had at least one false positive so far. After 1e5 inserts it will only be about 6e-15, depending on how many bit collisions you run into by then.

You can also get an upper bound on it for however many strings you have by just running through the 1e9+ calculations assuming no single-bit collisions ever happen. I won't wait for that to finish running here.

But it turns out that I don't have to:

Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6

- tye        

In reply to Re: [OT] The statistics of hashing. (birthday) by tye
in thread [OT] The statistics of hashing. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2022-01-19 13:03 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (55 votes). Check out past polls.