Problems? Is your data what you think it is? PerlMonks

Re^2: [OT] The statistics of hashing.

by BrowserUk (Pope)
 on Apr 02, 2012 at 15:02 UTC ( #963033=note: print w/replies, xml ) Need Help??

in reply to Re: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.

You say that “a disk based solution isn’t feasible,”

Using my hashing/vectors algorithm, my script processes 10,000 strings/second. To process 2**32 strings (the absolute limit of the hashes) would require 2**32 / 10000 / 3600 = 119 hours. IE. 0.000001s/string.

Using any sort of disk-based mechanism -- disk-based hash, b-tree, etc. -- would require at least 1 read and 1 write per string. Using the fastest (conventional) disks available that means adding at least 0.004 seconds for the read and 0.005 seconds for the write, to the processing time for every string.

So, that's 0.000001 + 0.004 + 0.005 = 0.009001 * 2**32 / (3600) = 10738 hrs or 447 days. A little under 1 1/4 years.

And the reality is that it will likely require at least 10 reads & 10 writes for every record. I'll leave that math to you.

but a fair amount of that is going to happen regardless.

When I stopped the process this morning, it had been running just under 62.5 hours.

In that time it had processed 2,375,798,283 strings and detected just 52,286 possible dups.

The cost of writing those 52,000 strings/positions to disk during the 62.5 hours of runtime is negligible; unmeasurable. Lost in the noise of processor variability due to variations in mains voltage. Ziltch.

Running a second pass to verify those 52,000 possible dups as false or real will require less time as a simple hash lookup can be used, so there is no need to maintain the vectors.

In any case, I'll trade 125 hours of runtime, for 1 1/4 years, every day of the week.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: [OT] The statistics of hashing.
by sundialsvc4 (Abbot) on Apr 02, 2012 at 15:40 UTC

Obviously, you assumed that I was just making “my usual suggestion” when I am not.   Your algorithm is quite plainly at least the beginnings of the only feasible solution in this case ... especially given that the actual number of duplicates is known to be negligible.   (0.0022% ...)   If the data volumes are in the 2-billion range and you are certain that you could only be committing a Type-2 error with no possibility of Type-1, then you have solved the problem right now.   You produced a highly sensitive first-pass filter.   99.98% of the records have been excluded.   That’s as solved as you can get ... and what you already know about the expected distribution (that is it is overwhelmingly likely that the records are unique) makes it so.

In effect your algorithm is to calculate a digital-digest of the entire record and then to look for possible collisions among those digests.   The discriminating ability of the algorithm will be that of MD5 and nothing more or less.   Taking the MD5 result and using the left/right-most n bits would be statistically stronger than applying a separate reducing calculation to the entire result.   MD5 and its brethren scatter entropy fairly-uniformly across all of the bits at once as they are designed to do.   The distribution of values ought to be very flat, and any subset of the total bits in the vector, if not combined with others, will be equally flattened and therefore equally conducive to your hash.

you assumed that I was just making “my usual suggestion”

I didn't. First I explained the derivation of of my “a disk based solution isn't feasible,” conclusion.

Whilst you didn't question it directly, you certainly contrived to minimise it by implication, when you said:

but a fair amount of that is going to happen regardless.

And that was the only assertion in your whole post that was sufficiently concrete that it could be responded to directly. So I did.

There are still a few things bothering me, though ... if a single hash-table has a discriminating ability of x%, chaining multiple hash-tables together, say, is not going to be more-discriminating....

Whilst that is a legitimate concern -- moritz already expressed a similar reservation in Re: [OT] The statistics of hashing.; -- I'd like to know the basis upon which you found the following strong, definitive conclusion?:

using the left/right-most n bits would be statistically stronger ...

Because, without some testable evidence -- or at the very least -- clearly expressed, logical reasoning that can be reviewed, it is nothing more than a guess.

• There is no way to utilise all 2**128 bits of the hash in a single vector.

It would require a quadrillion Yotabytes of ram, and cost a lot of money.

• The biggest single vector I could use is 8GB which would hold just 2**36 bits.

Thus I would be discarding 92-bits of information.

However, by using 4 x 2**32-bit vectors I can utilise all 128-bits. Whilst the extra discriminatory power is diluted relative to using a 2**128-bit vector, it is vastly increased over using just 2**36-bits. This is manifest.

I also alluded to my insecurities on a related subject in the OP -- that of 'stretching the entropy', by deriving 6 more hashes from the original 4 32-bit sub-hashes -- when I said: "does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)"

But in all my testing -- which appears to be confirmed by those of roboticus -- even these artificially generated 'additional bits' also have a very significant additional discriminatory affect.

On the basis of very preliminary data, the additional 6 hashes result in a reduction of the occurrence of false positives at the 1e9 insertions point of, ~3,000,000 with 4 hashes, to less than 500 using 10.

The upshot: The only two verifiable assertions in both your combined posts, are not just wrong, but 180° obverse to the truth. And the rest -- much of couched in your particular brand of flowery prose: "However, as the algorithm begins to become saturated, the ink-blots start to run together and you're left with a black piece of paper and not a picture." -- is just a reiteration of known facts that are not in question.

Update: Actually, I did you a disservice. There is a third verifiable assertion:

"This will force you into a possibly large number of passes, that can be processed in parallel say on a cluster box."

But, two passes are all that will ever be needed. And those runs only occupy 1 of my four processors, which I carefully affinity off from the rest of my activity on the machine so nothing else I do interferes with it.

So also, completely wrong.

Do you ever do any research? Or just post whatever nonsense pops into your head.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Create A New User
Node Status?
node history
Node Type: note [id://963033]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2020-12-03 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How often do you use taint mode?

Results (51 votes). Check out past polls.

Notices?