Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^4: [OT] The statistics of hashing.

by roboticus (Chancellor)
on Apr 03, 2012 at 13:32 UTC ( [id://963244]=note: print w/replies, xml ) Need Help??


in reply to Re^3: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.

BrowserUk:

In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants.

So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get:

f(x) = 1 - 2e^(-x/N) + e^(-2x/N) integ(f(x)) = integ( 1 - 2e^(-x/N) + e^(-2x/N) ) = integ( 1 ) - 2*(integ(e^(-x/N))) + integ(e^(-2x/N)) since: integ(1) = x + C integ(e^(bx)) = (1/b)e^(bx) + C integ(f(x)) = [ x+C ] + [ (-2N)e^(-x/N) + C ] + [ (-N/2)e^(-2x/N) + C +]

Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel.

Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play.

One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^5: [OT] The statistics of hashing.
by BrowserUk (Patriarch) on Apr 03, 2012 at 15:01 UTC

    Thanks. As may be (becoming) obvious, much of that is over my head for now :)

    However, Now I know how to derive the constants, I have this which I can substitute for your ex_4() & ex_10() by supplying the power as the first argument. Its output matches those two exactly for all the argument combinations I've tried:

    sub PTn { my @row; for( 1 .. shift ) { push @row, 1; $row [$_] += $row [$_ - 1] for reverse 1 .. @row - 2; } return @row; } sub expN { my( $P, $N, $X ) = @_; my $xDIVn = $X / $N; my @coefs = PTn( $P+1 ); my $rv = 0; my $toggle = -1; for my $p ( reverse 1 .. $P ) { $rv += $toggle * $coefs[ $p ] * $N / $p * exp( -$p * $xDIVn ); $toggle *= -1; } return $rv + $X; }

    That allowed me to investigate the affect of using fewer or more hashes without having to hand code a new function for each.

    And the results are quite interesting. This shows that each new (pair?) of hashes added does increase the discrimination substantially, though the gains obviously fall off fairly rapidly. But the really interesting part are the numbers for odd numbers of hashes:

    C:\test>rbtcs-form-verify -H=10 -B=32 Using N x 2**32 vectors: 12 : 24 0 24 0 24 + 0 24 0 24 0 16 : 32 0 31 0 31 + 0 31 0 31 0 24 : 48 0 48 0 48 + 0 48 0 48 0 32 : 64 0 64 0 64 + 0 64 0 64 0 48 : 95 0 95 0 95 + 0 95 0 96 0 64 : 127 0 127 0 127 + 0 128 0 128 0 96 : 191 0 191 0 192 + 0 192 0 192 0 128 : 255 0 256 0 256 + 0 256 0 256 0 192 : 383 0 383 0 383 + 0 384 0 384 0 256 : 511 0 512 0 512 + 0 512 0 512 0 384 : 767 0 768 0 768 + 0 768 0 768 0 512 : 1023 0 1024 0 1024 + 0 1024 0 1024 0 768 : 1535 0 1536 0 1536 + 0 1536 0 1536 0 1024 : 2047 0 2048 0 2048 + 0 2048 0 2048 0 1536 : 3071 0 3072 0 3072 + 0 3072 0 3072 0 2048 : 4095 0 4096 0 4096 + 0 4096 0 4096 0 3072 : 6143 0 6144 0 6144 + 0 6144 0 6144 0 4096 : 8191 0 8192 0 8192 + 0 8192 0 8192 0 6144 : 12287 0 12288 0 12288 + 0 12288 0 12288 0 8192 : 16383 0 16384 0 16383 + 0 16383 0 16384 0 12288 : 24575 0 24576 0 24576 + 0 24576 0 24576 0 16384 : 32767 0 32767 0 32767 + 0 32767 0 32768 0 24576 : 49151 0 49151 0 49152 + 0 49152 0 49152 0 32768 : 65535 0 65536 0 65536 + 0 65536 0 65536 0 49152 : 98303 0 98304 0 98304 + 0 98303 0 98303 0 65536 : 131071 0 131072 0 131071 + 0 131072 0 131072 0 98304 : 196606 0 196608 0 196607 + 0 196608 0 196608 0 131072 : 262142 0 262144 0 262144 + 0 262143 0 262144 0 196608 : 393211 0 393216 0 393216 + 0 393215 0 393216 0 262144 : 524280 0 524288 0 524287 + 0 524287 0 524288 0 393216 : 786414 0 786431 0 786431 + 0 786432 0 786432 0 524288 : 1048544 0 1048575 0 1048576 + 0 1048576 0 1048576 0 786432 : 1572792 0 1572863 0 1572864 + 0 1572864 0 1572864 0 1048576 : 2097024 0 2097151 0 2097151 + 0 2097152 0 2097152 0 1572864 : 3145440 0 3145727 0 3145728 + 0 3145727 0 3145728 0 2097152 : 4193792 0 4194303 0 4194304 + 0 4194304 0 4194304 0 3145728 : 6290304 0 6291455 0 6291455 + 0 6291455 0 6291455 0 4194304 : 8386560 1 8388607 0 8388608 + 0 8388608 0 8388608 0 6291456 : 12578306 4 12582911 0 12582912 + 0 12582912 0 12582912 0 8388608 : 16769029 10 16777215 0 16777216 + 0 16777215 0 16777216 0 12582912 : 25147409 35 25165823 0 25165823 + 0 25165824 0 25165824 0 16777216 : 33521706 85 33554431 0 33554431 + 0 33554431 0 33554432 0 25165824 : 50258063 286 50331646 0 50331647 + 0 50331648 0 50331648 0 33554432 : 66978132 678 67108860 0 67108863 + 0 67108864 0 67108864 0 50331648 : 100369532 2283 100663276 0 100663295 + 0 100663296 0 100663296 0 67108864 : 133696160 5397 134217665 0 134217727 + 0 134217728 0 134217728 0 100663296 : 200156106 18111 201326276 5 201326591 + 0 201326591 0 201326592 0 134217728 : 266359979 42681 268434469 24 268435455 + 0 268435455 0 268435455 0 201326592 : 398007464 142383 402648282 179 402653177 + 0 402653183 0 402653184 0 268435456 : 528654369 333608 536855705 738 536870874 + 1 536870911 0 536870911 0 402653184 : 787008255 1100214 805232175 5328 805305969 + 30 805306365 0 805306367 0 536870912 : 1041542872 2548692 1073515796 21337 1073739728 + 211 1073741802 2 1073741823 0 805306368 : 1539620713 8218836 1609548824 146448 1610591774 + 3082 1610612273 70 1610612725 1 1073741824 : 2023785226 18623993 2144354575 558473 2147380065 + 19731 2147479815 755 2147483497 30 1610612736 : 2953695056 57532294 3207472286 3485537 3220308576 + 247526 3221157361 19018 3221220099 1532 2147483648 : 3837421596 125076314 4257101987 12129165 4290939237 + 1371778 4294491373 167491 4294907677 21417 3221225472 : 5487393872 357203949 6295545197 63685472 6413893311 +13112112 6436324179 2901774 6441061714 670967 4294967296 : 7009904423 721946381 8229596479 188903097 8487725572 +56541428 8558136669 18112153 8579512265 6047518

    I suspect it is a coding error on my behalf, but I guess it could be a quirk of the numbers?

    Here's a set using 1 .. 16 2**16 bit hashes:

    C:\test>rbtcs-form-verify -H=16 -B=16 Using N x 2**16 vectors: 12 : 23 0 23 0 24 0 24 0 + 24 0 23 0 23 0 23 0 16 : 31 0 31 0 31 0 32 0 + 32 0 32 0 32 0 32 0 24 : 47 0 47 0 48 0 47 0 + 48 0 47 0 47 0 47 0 32 : 63 0 63 0 64 0 64 0 + 64 0 64 0 64 0 64 0 48 : 95 0 95 0 95 0 95 0 + 95 0 96 0 96 0 96 0 64 : 127 0 127 0 128 0 128 0 + 128 0 127 0 128 0 127 0 96 : 191 0 191 0 192 0 192 0 + 192 0 192 0 191 0 191 0 128 : 255 0 255 0 256 0 255 0 + 256 0 256 0 256 0 256 0 192 : 383 0 383 0 383 0 384 0 + 384 0 384 0 384 0 384 0 256 : 511 0 511 0 511 0 511 0 + 512 0 511 0 512 0 512 0 384 : 766 0 767 0 767 0 768 0 + 768 0 767 0 768 0 767 0 512 : 1022 0 1023 0 1023 0 1024 0 + 1024 0 1024 0 1024 0 1024 0 768 : 1531 0 1535 0 1535 0 1536 0 + 1536 0 1536 0 1536 0 1536 0 1024 : 2040 0 2047 0 2047 0 2048 0 + 2048 0 2048 0 2048 0 2048 0 1536 : 3054 0 3071 0 3071 0 3071 0 + 3072 0 3072 0 3072 0 3072 0 2048 : 4064 0 4095 0 4095 0 4095 0 + 4095 0 4096 0 4096 0 4096 0 3072 : 6073 2 6143 0 6143 0 6143 0 + 6144 0 6144 0 6144 0 6144 0 4096 : 8066 5 8191 0 8191 0 8191 0 + 8191 0 8191 0 8191 0 8192 0 6144 : 12008 16 12286 0 12287 0 12287 0 +12287 0 12287 0 12288 0 12287 0 8192 : 15892 38 16380 0 16383 0 16383 0 +16383 0 16383 0 16383 0 16383 0 12288 : 23492 125 24559 2 24575 0 24575 0 +24575 0 24575 0 24575 0 24576 0 16384 : 30880 284 32720 8 32766 0 32767 0 +32767 0 32767 0 32767 0 32767 0 24576 : 45069 877 48942 53 49138 3 49150 0 +49151 0 49151 0 49151 0 49151 0 32768 : 58554 1908 64958 185 65474 20 65528 2 +65535 0 65535 0 65535 0 65535 0 49152 : 83730 5450 96062 971 97868 200 98210 44 +98282 10 98299 2 98302 0 98303 0 65536 : 106962 11016 125573 2882 129512 862 130586 276 1 +30912 92 131018 31 131053 11 131065 3

    Sorry for the wrapping.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I suspect you need to start $toggle reversed for odd values (or for even values, though you note that you verified against the two even cases). Though, I would have expected that particular mistake to lead to negative values (but you don't show most of the code so I didn't go much further in looking). The numbers are supposed to be f(...)**$P where f() is positive (and monotonic), so, yes, odd values of $P should give you results between the results for $P-1 and $P+1, not like the results you posted.

      Sorry for the wrapping.

      You might want to put the big tables into READMORE tags so that they don't impact the rendering of the whole thread (though that should only happen for people who have code wrapping badly configured).

      - tye        

        I suspect you need to start $toggle reversed for odd values

        Indeed. Using my $toggle = $P & 1 ? 1 : -1;, the output makes much more sense:

        C:\test>rbtcs-form-verify -H=16 -B=16 Using N x 2**16 vectors: 12 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 16 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 24 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 32 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 48 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 64 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 96 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 128 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 192 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 256 : 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 384 : 1 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 512 : 1 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 768 : 4 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 1024 : 7 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 1536 : 17 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 2048 : 31 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 3072 : 70 2 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 4096 : 125 5 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 6144 : 279 16 1 0 0 0 0 0 + 0 0 0 0 0 0 0 0 8192 : 491 38 3 0 0 0 0 0 + 0 0 0 0 0 0 0 0 12288 : 1083 125 16 2 0 0 0 0 + 0 0 0 0 0 0 0 0 16384 : 1887 284 47 8 1 0 0 0 + 0 0 0 0 0 0 0 0 24576 : 4082 877 209 53 13 3 1 0 + 0 0 0 0 0 0 0 0 32768 : 6981 1908 577 185 61 20 7 2 + 0 0 0 0 0 0 0 0 49152 : 14573 5450 2241 971 435 200 93 44 + 21 10 4 2 1 0 0 0 65536 : 24109 11016 5498 2882 1559 862 485 276 + 159 92 53 31 18 11 6 3
        you don't show most of the code so I didn't go much further in looking

        The entire code looks like this:

        #! perl -slw use strict; sub PTn { my @row; for( 1 .. shift ) { push @row, 1; $row [$_] += $row [$_ - 1] for reverse 1 .. @row - 2; } return @row; } sub expN { my( $P, $N, $X ) = @_; my $xDIVn = $X / $N; my @coefs = PTn( $P+1 ); my $rv = 0; my $toggle = $P & 1 ? 1 : -1; for my $p ( reverse 1 .. $P ) { $rv += $toggle * $coefs[ $p ] * $N / $p * exp( -$p * $xDIVn ); $toggle *= -1; } return $rv + $X; } our $H //= 10; our $B //= 32; print "Using N x 2**$B vectors:"; for my $inserts ( map{ 2**$_*3/4, 2**$_ } 4 .. $B ) { printf "%10d : ", $inserts; for my $h ( 1 .. $H ) { printf "%6d ", expN( $h, 2**$B, $inserts ) - expN( $h, 2**$B, +0 ); } print ''; }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://963244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-28 16:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found