There's more than one way to do things PerlMonks

### comment on

 Need Help??

I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow.

The mechanism chosen for the task is as follows:

1. Calculate the 128-bit MD5 hash of each string.
2. Break that into 4 x 32-bit integers.
3. Use those four integers to set bits in each of 4 corresponding 2**32-bit vectors.

The question is: What is the probability of getting false positives?

That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors.

Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution.

• When setting the first value into the vectors, the odds of a false collision are obviously 0.
• When setting the second value in, the odds of all 4 sub hashes being the same, whilst the MD5 is different are again zero.
• When setting the third value in, the odds that each of the 4 sub hashes will match 1 of the two sub hashes previously set into that vector are still very small, but tangible.
• And as each new quad of sub hashes is added to the vectors, the odds of a false positive increase over the previous addition.

#### How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?

Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)

The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head.

I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief.

Can these probabilities be calculated in a way that a stats layman (me) will understand?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In reply to [OT] The statistics of hashing. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2020-08-14 14:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Which rocket would you take to Mars?

Results (75 votes). Check out past polls.

Notices?