http://qs321.pair.com?node_id=963049


in reply to Re^2: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.

Obviously, you assumed that I was just making “my usual suggestion” when I am not.   Your algorithm is quite plainly at least the beginnings of the only feasible solution in this case ... especially given that the actual number of duplicates is known to be negligible.   (0.0022% ...)   If the data volumes are in the 2-billion range and you are certain that you could only be committing a Type-2 error with no possibility of Type-1, then you have solved the problem right now.   You produced a highly sensitive first-pass filter.   99.98% of the records have been excluded.   That’s as solved as you can get ... and what you already know about the expected distribution (that is it is overwhelmingly likely that the records are unique) makes it so.

In effect your algorithm is to calculate a digital-digest of the entire record and then to look for possible collisions among those digests.   The discriminating ability of the algorithm will be that of MD5 and nothing more or less.   Taking the MD5 result and using the left/right-most n bits would be statistically stronger than applying a separate reducing calculation to the entire result.   MD5 and its brethren scatter entropy fairly-uniformly across all of the bits at once as they are designed to do.   The distribution of values ought to be very flat, and any subset of the total bits in the vector, if not combined with others, will be equally flattened and therefore equally conducive to your hash.

Replies are listed 'Best First'.
Re^4: [OT] The statistics of hashing.
by BrowserUk (Patriarch) on Apr 02, 2012 at 16:46 UTC
    you assumed that I was just making “my usual suggestion”

    I didn't. First I explained the derivation of of my “a disk based solution isn't feasible,” conclusion.

    Whilst you didn't question it directly, you certainly contrived to minimise it by implication, when you said:

    but a fair amount of that is going to happen regardless.

    And that was the only assertion in your whole post that was sufficiently concrete that it could be responded to directly. So I did.

    There are still a few things bothering me, though ... if a single hash-table has a discriminating ability of x%, chaining multiple hash-tables together, say, is not going to be more-discriminating....

    Whilst that is a legitimate concern -- moritz already expressed a similar reservation in Re: [OT] The statistics of hashing.; -- I'd like to know the basis upon which you found the following strong, definitive conclusion?:

    using the left/right-most n bits would be statistically stronger ...

    Because, without some testable evidence -- or at the very least -- clearly expressed, logical reasoning that can be reviewed, it is nothing more than a guess.

    • There is no way to utilise all 2**128 bits of the hash in a single vector.

      It would require a quadrillion Yotabytes of ram, and cost a lot of money.

    • The biggest single vector I could use is 8GB which would hold just 2**36 bits.

      Thus I would be discarding 92-bits of information.

    However, by using 4 x 2**32-bit vectors I can utilise all 128-bits. Whilst the extra discriminatory power is diluted relative to using a 2**128-bit vector, it is vastly increased over using just 2**36-bits. This is manifest.

    I also alluded to my insecurities on a related subject in the OP -- that of 'stretching the entropy', by deriving 6 more hashes from the original 4 32-bit sub-hashes -- when I said: "does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)"

    But in all my testing -- which appears to be confirmed by those of roboticus -- even these artificially generated 'additional bits' also have a very significant additional discriminatory affect.

    On the basis of very preliminary data, the additional 6 hashes result in a reduction of the occurrence of false positives at the 1e9 insertions point of, ~3,000,000 with 4 hashes, to less than 500 using 10.

    The upshot: The only two verifiable assertions in both your combined posts, are not just wrong, but 180° obverse to the truth. And the rest -- much of couched in your particular brand of flowery prose: "However, as the algorithm begins to become saturated, the ink-blots start to run together and you're left with a black piece of paper and not a picture." -- is just a reiteration of known facts that are not in question.

    Update: Actually, I did you a disservice. There is a third verifiable assertion:

    "This will force you into a possibly large number of passes, that can be processed in parallel say on a cluster box."

    But, two passes are all that will ever be needed. And those runs only occupy 1 of my four processors, which I carefully affinity off from the rest of my activity on the machine so nothing else I do interferes with it.

    So also, completely wrong.

    Do you ever do any research? Or just post whatever nonsense pops into your head.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?