|There's more than one way to do things|
I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow.
The mechanism chosen for the task is as follows:
The question is: What is the probability of getting false positives?
That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors.
Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution.
How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?
Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)
The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head.
I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief.
Can these probabilities be calculated in a way that a stats layman (me) will understand?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.