There's more than one way to do things  
PerlMonks 
comment on 
( #3333=superdoc: print w/replies, xml )  Need Help?? 
I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an inmemory hash or suffix tree is impractical. And diskbased solutions too slow. The mechanism chosen for the task is as follows:
The question is: What is the probability of getting false positives? That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors. Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32bit hashes provide perfect distribution.
How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?) The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head. I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief. Can these probabilities be calculated in a way that a stats layman (me) will understand? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In reply to [OT] The statistics of hashing. by BrowserUk

