Re^4: [OT] The statistics of hashing.

you assumed that I was just making “my usual suggestion”

I didn't. First I explained the derivation of of my “a disk based solution isn't feasible,” conclusion.

Whilst you didn't question it directly, you certainly contrived to minimise it by implication, when you said:

but a fair amount of that is going to happen regardless.

And that was the only assertion in your whole post that was sufficiently concrete that it could be responded to directly. So I did.

There are still a few things bothering me, though ... if a single hash-table has a discriminating ability of x%, chaining multiple hash-tables together, say, is not going to be more-discriminating....

Whilst that is a legitimate concern -- moritz already expressed a similar reservation in Re: [OT] The statistics of hashing.; -- I'd like to know the basis upon which you found the following strong, definitive conclusion?:

using the left/right-most n bits would be statistically stronger ...

Because, without some testable evidence -- or at the very least -- clearly expressed, logical reasoning that can be reviewed, it is nothing more than a guess.

There is no way to utilise all 2**128 bits of the hash in a single vector.
It would require a quadrillion Yotabytes of ram, and cost a lot of money.
The biggest single vector I could use is 8GB which would hold just 2**36 bits.
Thus I would be discarding 92-bits of information.

However, by using 4 x 2**32-bit vectors I can utilise all 128-bits. Whilst the extra discriminatory power is diluted relative to using a 2**128-bit vector, it is vastly increased over using just 2**36-bits. This is manifest.

I also alluded to my insecurities on a related subject in the OP -- that of 'stretching the entropy', by deriving 6 more hashes from the original 4 32-bit sub-hashes -- when I said: "does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)"

But in all my testing -- which appears to be confirmed by those of roboticus -- even these artificially generated 'additional bits' also have a very significant additional discriminatory affect.

On the basis of very preliminary data, the additional 6 hashes result in a reduction of the occurrence of false positives at the 1e9 insertions point of, ~3,000,000 with 4 hashes, to less than 500 using 10.

The upshot: The only two verifiable assertions in both your combined posts, are not just wrong, but 180° obverse to the truth. And the rest -- much of couched in your particular brand of flowery prose: "However, as the algorithm begins to become saturated, the ink-blots start to run together and you're left with a black piece of paper and not a picture." -- is just a reiteration of known facts that are not in question.

Update: Actually, I did you a disservice. There is a third verifiable assertion:

"This will force you into a possibly large number of passes, that can be processed in parallel say on a cluster box."

But, two passes are all that will ever be needed. And those runs only occupy 1 of my four processors, which I carefully affinity off from the rest of my activity on the machine so nothing else I do interferes with it.

So also, completely wrong.

Do you ever do any research? Or just post whatever nonsense pops into your head.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re^4: [OT] The statistics of hashing.


Keep It Simple, Stupid
	PerlMonks