Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: The statistics of hashing.

by BrowserUk (Patriarch)
on Mar 31, 2012 at 23:26 UTC ( [id://962808]=note: print w/replies, xml ) Need Help??


in reply to Re: The statistics of hashing.
in thread [OT] The statistics of hashing.

Seems almost certain to me,

That's a given. As with bloom filters, false positives are a fact of life. The question is how likely, or more to the point, how frequently, they will occur.

According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

The following is empirical evidence from an on-going experiment.

After 3/4 of a billion trials, there were zero (possible) dups. After 1.2 billion, there are only 61.

These are the positions at which those possible dups were detected:

779967210 782382025 830840395 882115420 883031614 897571477 923155269 948108745 954455113 967783959 988381482 992691311 995935158 1011301141 1013742872 1022258193 1031874023 1034026887 1036254774 1037064360 1037193945 1037309710 1037661519 1041973643 1042577179 1045603197 1046414487 1047056233 1048652112 1048960948 1052042118 1057413610 1059736720 1067108427 1068042962 1069550972 1070612878 1075526924 1079515116 1079995733 1080999379 1086256410 1098797420 1110115220 1121074450 1121078210 1121376561 1132692265 1133452843 1138397053 1141581831 1143980294 1148289231 1149508590 1149743067 1151539362 1155227820 1156087258 1156164711 1158795083 1163191879 1165623466 1167648959 1168308302 1169005605 1173266361 1173729947 1174309756 1176812463 1178780066 1181065368 1183536832 1183744519 1183964959 1185195302 1185303999 1186231527 1186435292 1187445584 1189170990 1192726357 1195889596 1198540465 1198563209 1198727961 1201384004 1203571325 1204470932 1205045006 1205186694 1206609472 1209383402 1211297256 1212504859 1214473828 1217800153 1218236491 1219282533 1219801192 1220616524 1222841081 1226558804 1227645351 1229411437 1230159513 1230637415 1232098200 1232312863 1232491027 1233958398 1234452741 1237085887 1237298288 1237708765 1238281678 1238455038 1239452137 1240723226 1240980106 1241466028 1242695443 1242936270 1243921116 1245408660 1245914539 1247643815 1248306653 1249238533 1250144683 1250278843 1253111263 1254066285 1254097515 1254117382 1254371720 1255160872 1255282569 1255309049 1255779636 1257210934 1257614617 1259207077 1259923475 1261856200 1262605109 1262859049

As you can see, having filled 1/4 quarter of the total space available, false positives are just beginning to occur. And as the vector space fills, they are (as expected) becoming more frequent. (Of course, some of those may be actual duplicates. It will take a second run to determine that.)

But the question remains, how to calculate the probabilities of the mechanism.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: The statistics of hashing.
by tobyink (Canon) on Apr 01, 2012 at 09:13 UTC

    Unfortunately, a good deal of what you read on wikipedia is less than reliable.

    Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority (short-lived events of vandalism notwithstanding).

    But the question remains, how to calculate the probabilities of the mechanism.

    The exact calculation involves some big numbers. But assuming that c($x, $y) is the "pick $y from $x" function used in combinatorics, then the probability of a collision for $n strings and an evenly distributed 32-bit hash function should be:

    $p = 1 - ( factorial($n) * c(2**32, $n) / 365**$n )

    Big numbers. Horrible to calculate. Can be approximated though...

    sub e () { 2.718281828 } my $t = ($n**2) / (2**33); $p = 1 - ( e ** -$t );

    Calculating $p is still horrible, but calculating $t is easier. If $t is above 20 then $p is 1.00000 when rounded to 6 significant figures.

    Thus you can effectively be sure to have a collision with a 32-bit hash function once $t is above 20. You can figure out an $n which triggers $t to be 20 using:

    $n = sqrt(20 * (2 ** 33));

    It's about 414,000. So with 414,000 strings, you are effectively certain to get collision on a 32-bit hash function.

    Where I think my reasoning and tye's differ (and tye is almost certainly correct here - blame it on me answering late at night) is that I was then looking at the probabilities that you will have had collisions in all four (or ten) hash functions at the end of the entire run. With even half a million strings, that is a given.

    What you're actually doing is looking at events where a single string triggers a simultaneous collision in all the hash functions. I defer to tye's calculations for that.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority

      I think the first mistake that you (and Tye) are making here, is assuming that "the birthday paradox" (as applied to the each of the 32-bit hashes individually) is the controlling factor. It isn't.

      The controlling factor is the combinatorial affect of the 4 (or, in my case 10) separate hashes & vectors.

      I think that applying the birthday paradox to the 128-bit hash may prove to be an accurate upper bound. But it is so large relative to the limits of the 32-bit vectors as to never actually come into play.

      I had tried applying various formulae I found on wikipedia and elsewhere, but found they do not match with reality. Hence, this post.

      You might also look at the math described for Bloom filters. It also contradicts these naive applications of the birthday paradox, but is sufficiently well researched and documented (sources outside of wikipedia) to show that naive application of the birthday paradox is not applicable when combining multiple hashes in this way.

      The problem with the Bloom filter math is that it is based upon setting the bits derived from all the different hashes into the same vector. Ie. Each Bloom filter insert sets N bits in a single vector, where N is the number of different hashes being used. This obviously causes the vector to fill up much more quickly than my mechanism which uses a separate vector for each hash.

      My gut feel, is that as my vectors fill up N times more slowly, the probabilities of collisions for my mechanism are 2**N times lower than with a Bloom filter (at the expense of N*the memory requirement). This is the notion I am trying to verify here.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962808]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-03-28 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found