perlquestion BrowserUk <p>I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow. <p>The mechanism chosen for the task is as follows: <ol><li>Calculate the 128-bit MD5 hash of each string. </li><li>Break that into 4 x 32-bit integers. </li><li>Use those four integers to set bits in each of 4 corresponding 2**32-bit vectors. </li></ol> <p>The question is: What is the probability of getting false positives? <p>That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors. <p>Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution. <ul><li>When setting the first value into the vectors, the odds of a false collision are obviously 0. </li><li>When setting the second value in, the odds of all 4 sub hashes being the same, whilst the MD5 is different are again zero. </li><li>When setting the third value in, the odds that each of the 4 sub hashes will match 1 of the two sub hashes previously set into that vector are still very small, but tangible. </li><li>And as each new quad of sub hashes is added to the vectors, the odds of a false positive increase over the previous addition. </li></ul> <h4>How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?</h4> <p>Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?) <P>The mechanism is a variation on the theme of [http://en.wikipedia.org/wiki/Bloom_filters|Bloom filters]. The statistics of bloom filters is documented, but goes over my head. <p>I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief. <p>Can these probabilities be calculated in a way that a stats layman (me) will understand? <div class="pmsig"><div class="pmsig-171588"> <hr /> <font size=1 > <div>With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'</div> <div>Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.</div> <div>"Science is about questioning the status quo. Questioning authority". </div> <div>In the absence of evidence, opinion is indistinguishable from prejudice. <p align=right>[http://www.theregister.co.uk/2011/11/29/sas_versus_world_programming/|The start of some sanity?]</p></div> </font> </div></div>