comment on

I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow.

The mechanism chosen for the task is as follows:

Calculate the 128-bit MD5 hash of each string.
Break that into 4 x 32-bit integers.
Use those four integers to set bits in each of 4 corresponding 2**32-bit vectors.

The question is: What is the probability of getting false positives?

That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors.

Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution.

When setting the first value into the vectors, the odds of a false collision are obviously 0.
When setting the second value in, the odds of all 4 sub hashes being the same, whilst the MD5 is different are again zero.
When setting the third value in, the odds that each of the 4 sub hashes will match 1 of the two sub hashes previously set into that vector are still very small, but tangible.
And as each new quad of sub hashes is added to the vectors, the odds of a false positive increase over the previous addition.

How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?

Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)

The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head.

I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief.

Can these probabilities be calculated in a way that a stats layman (me) will understand?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In reply to [OT] The statistics of hashing. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks