Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Although this sounds similar to the birthday problem, it is mostly much simpler (and partly much harder).

The odds of a collision at each step are quite simply $setBits/2**32. So the odds of having 4 collisions in a row are prod(@setBits)/2**(4*32). So, if after inserting 100 items, your 2**32-bit vectors have, respectively, 99, 100, 100, and 98 bits set, then the odds of a 4-fold collision is 99*100*100*98/2**(32*4) or about 1 in 1e31.

Now, the odds of there being 100 bits set after setting 100 of the bits, that is the same as the birthday problem. So, if you wanted to predict the likely odds of a collision after inserting, say, 1e4 strings, then you would first compute the odds of the full 1e4 different bits being set. The birthday problem tells us that the odds are about 98.84% that all 1e4 bits would be set in a given vector. Or that there is a 1.16% chance of at least one collision (in a given vector).

Bits Collision set odds (2**32 bits) --- --------- 1e3 0.01% 1e4 1.16% 1e5 68.8% 1e6 100% (to about 50 digits of accuracy)

As calculated using Math::BigApprox:

use Math::BigApprox 'c'; my $bits = 2**32; my $inserts = 1e5; print 1-(c($bits-$inserts)^c($bits-1))/c($bits)**$inserts, $/;

So, when you insert 1e6 strings, you are not going to end up with 1e6 bits set. But getting a precise value for "expected number of bits set" is rather complex. But I wouldn't bother. I'd just count the number of bits that are actually set as you go along.

Then, when one goes to insert the next string, just calculate $b1*$b2*$b3*$b4/2**(32*4) in order to get the odds of a false positive on that insertion.

But inserting about 1e6 1e9 strings into 2**32-bit vectors means that about 23% of the bits are going to be set. So, your 1e6th 1e9th insert is going to have about a 0.28% chance of a false positive.

If you want to calculate the odds of there having been at least one false positive across the entire sequence of 1e6 inserts, then you'd compute 1-prod(@odds) where @odds contains the odds of no false positive at each step. So $odds[$i] would be 1 - $b1*$b2*$b3*$b4/2**(32*4). So you can calculate such that way, by just keeping a running product of that value.

Note, however, that prod(@odds) will stay exactly 1.0 when using floating point until you get to the point that $b1*$b2*$b3*$b4/2**(32*4) creeps above about 1/2**$mantissaBits, (which won't happen until about 1/2 million inserts for typical 'double'). I'm not sure how much of an impact that will have on the accuracy when getting to the only interesting odds in the late stages. But it seems it might be quite significant.

But you can compensate for this by instead tracking 1-prod(@odds). That starts out simply as 1*1*1*1/2**(32*4) after the first insert. To effectively calculate 1-prod(@odds[0..$n]) when given 1-prod(@odds[0..$n-1]) and $odds[$n] isn't hard. It is just $prod += $odds-$prod*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

So, doing that calculation as you go will give you the odds of having had at least one false positive so far. After 1e5 inserts it will only be about 6e-15, depending on how many bit collisions you run into by then.

You can also get an upper bound on it for however many strings you have by just running through the 1e9+ calculations assuming no single-bit collisions ever happen. I won't wait for that to finish running here.

But it turns out that I don't have to:

Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6

- tye        

In reply to Re: [OT] The statistics of hashing. (birthday) by tye
in thread [OT] The statistics of hashing. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (3)
    As of 2020-12-03 06:48 GMT
    Find Nodes?
      Voting Booth?
      How often do you use taint mode?

      Results (51 votes). Check out past polls.