in reply to Re^3: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.
In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants.
So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get:
f(x) = 1 - 2e^(-x/N) + e^(-2x/N) integ(f(x)) = integ( 1 - 2e^(-x/N) + e^(-2x/N) ) = integ( 1 ) - 2*(integ(e^(-x/N))) + integ(e^(-2x/N)) since: integ(1) = x + C integ(e^(bx)) = (1/b)e^(bx) + C integ(f(x)) = [ x+C ] + [ (-2N)e^(-x/N) + C ] + [ (-N/2)e^(-2x/N) + C +]
Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel.
Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play.
One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^5: [OT] The statistics of hashing.
by BrowserUk (Patriarch) on Apr 03, 2012 at 15:01 UTC | |
Thanks. As may be (becoming) obvious, much of that is over my head for now :) However, Now I know how to derive the constants, I have this which I can substitute for your ex_4() & ex_10() by supplying the power as the first argument. Its output matches those two exactly for all the argument combinations I've tried:
That allowed me to investigate the affect of using fewer or more hashes without having to hand code a new function for each. And the results are quite interesting. This shows that each new (pair?) of hashes added does increase the discrimination substantially, though the gains obviously fall off fairly rapidly. But the really interesting part are the numbers for odd numbers of hashes:
I suspect it is a coding error on my behalf, but I guess it could be a quirk of the numbers? Here's a set using 1 .. 16 2**16 bit hashes:
Sorry for the wrapping. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by tye (Sage) on Apr 03, 2012 at 15:43 UTC | |
I suspect you need to start $toggle reversed for odd values (or for even values, though you note that you verified against the two even cases). Though, I would have expected that particular mistake to lead to negative values (but you don't show most of the code so I didn't go much further in looking). The numbers are supposed to be f(...)**$P where f() is positive (and monotonic), so, yes, odd values of $P should give you results between the results for $P-1 and $P+1, not like the results you posted. Sorry for the wrapping. You might want to put the big tables into READMORE tags so that they don't impact the rendering of the whole thread (though that should only happen for people who have code wrapping badly configured). - tye | [reply] |
by BrowserUk (Patriarch) on Apr 03, 2012 at 16:07 UTC | |
I suspect you need to start $toggle reversed for odd values Indeed. Using my $toggle = $P & 1 ? 1 : -1;, the output makes much more sense: <Reveal this spoiler or all in this thread>
you don't show most of the code so I didn't go much further in looking The entire code looks like this:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |