in reply to [OT] The statistics of hashing.

Although this sounds similar to the birthday problem, it is mostly much simpler (and partly much harder).

The odds of a collision at each step are quite simply $setBits/2**32. So the odds of having 4 collisions in a row are prod(@setBits)/2**(4*32). So, if after inserting 100 items, your 2**32-bit vectors have, respectively, 99, 100, 100, and 98 bits set, then the odds of a 4-fold collision is 99*100*100*98/2**(32*4) or about 1 in 1e31.

Now, the odds of there being 100 bits set after setting 100 of the bits, that is the same as the birthday problem. So, if you wanted to predict the likely odds of a collision after inserting, say, 1e4 strings, then you would first compute the odds of the full 1e4 different bits being set. The birthday problem tells us that the odds are about 98.84% that all 1e4 bits would be set in a given vector. Or that there is a 1.16% chance of at least one collision (in a given vector).

Bits Collision set odds (2**32 bits) --- --------- 1e3 0.01% 1e4 1.16% 1e5 68.8% 1e6 100% (to about 50 digits of accuracy)

As calculated using Math::BigApprox:

use Math::BigApprox 'c'; my $bits = 2**32; my $inserts = 1e5; print 1-(c($bits-$inserts)^c($bits-1))/c($bits)**$inserts, $/;

So, when you insert 1e6 strings, you are not going to end up with 1e6 bits set. But getting a precise value for "expected number of bits set" is rather complex. But I wouldn't bother. I'd just count the number of bits that are actually set as you go along.

Then, when one goes to insert the next string, just calculate $b1*$b2*$b3*$b4/2**(32*4) in order to get the odds of a false positive on that insertion.

But inserting about 1e6 1e9 strings into 2**32-bit vectors means that about 23% of the bits are going to be set. So, your 1e6th 1e9th insert is going to have about a 0.28% chance of a false positive.

If you want to calculate the odds of there having been at least one false positive across the entire sequence of 1e6 inserts, then you'd compute 1-prod(@odds) where @odds contains the odds of no false positive at each step. So $odds[$i] would be 1 - $b1*$b2*$b3*$b4/2**(32*4). So you can calculate such that way, by just keeping a running product of that value.

Note, however, that prod(@odds) will stay exactly 1.0 when using floating point until you get to the point that $b1*$b2*$b3*$b4/2**(32*4) creeps above about 1/2**$mantissaBits, (which won't happen until about 1/2 million inserts for typical 'double'). I'm not sure how much of an impact that will have on the accuracy when getting to the only interesting odds in the late stages. But it seems it might be quite significant.

But you can compensate for this by instead tracking 1-prod(@odds). That starts out simply as 1*1*1*1/2**(32*4) after the first insert. To effectively calculate 1-prod(@odds[0..$n]) when given 1-prod(@odds[0..$n-1]) and $odds[$n] isn't hard. It is just $prod += $odds-$prod*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.

So, doing that calculation as you go will give you the odds of having had at least one false positive so far. After 1e5 inserts it will only be about 6e-15, depending on how many bit collisions you run into by then.

You can also get an upper bound on it for however many strings you have by just running through the 1e9+ calculations assuming no single-bit collisions ever happen. I won't wait for that to finish running here.

But it turns out that I don't have to:

Odds Inserts ---- ------- 1% ~28e6 10% ~45e6 50% ~65e6 90% ~83e6 99% ~96e6

- tye