BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow.
The mechanism chosen for the task is as follows:
- Calculate the 128-bit MD5 hash of each string.
- Break that into 4 x 32-bit integers.
- Use those four integers to set bits in each of 4 corresponding 2**32-bit vectors.
The question is: What is the probability of getting false positives?
That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors.
Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution.
- When setting the first value into the vectors, the odds of a false collision are obviously 0.
- When setting the second value in, the odds of all 4 sub hashes being the same, whilst the MD5 is different are again zero.
- When setting the third value in, the odds that each of the 4 sub hashes will match 1 of the two sub hashes previously set into that vector are still very small, but tangible.
- And as each new quad of sub hashes is added to the vectors, the odds of a false positive increase over the previous addition.
How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?
Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)
The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head.
I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief.
Can these probabilities be calculated in a way that a stats layman (me) will understand?