Pathologically Eclectic Rubbish Lister PerlMonks

Re^2: The statistics of hashing.

by BrowserUk (Pope)
 on Mar 31, 2012 at 23:26 UTC ( #962808=note: print w/replies, xml ) Need Help??

in reply to Re: The statistics of hashing.
in thread [OT] The statistics of hashing.

Seems almost certain to me,

That's a given. As with bloom filters, false positives are a fact of life. The question is how likely, or more to the point, how frequently, they will occur.

According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

The following is empirical evidence from an on-going experiment.

After 3/4 of a billion trials, there were zero (possible) dups. After 1.2 billion, there are only 61.

These are the positions at which those possible dups were detected:

```779967210
782382025
830840395
882115420
883031614
897571477
923155269
948108745
954455113
967783959
988381482
992691311
995935158
1011301141
1013742872
1022258193
1031874023
1034026887
1036254774
1037064360
1037193945
1037309710
1037661519
1041973643
1042577179
1045603197
1046414487
1047056233
1048652112
1048960948
1052042118
1057413610
1059736720
1067108427
1068042962
1069550972
1070612878
1075526924
1079515116
1079995733
1080999379
1086256410
1098797420
1110115220
1121074450
1121078210
1121376561
1132692265
1133452843
1138397053
1141581831
1143980294
1148289231
1149508590
1149743067
1151539362
1155227820
1156087258
1156164711
1158795083
1163191879
1165623466
1167648959
1168308302
1169005605
1173266361
1173729947
1174309756
1176812463
1178780066
1181065368
1183536832
1183744519
1183964959
1185195302
1185303999
1186231527
1186435292
1187445584
1189170990
1192726357
1195889596
1198540465
1198563209
1198727961
1201384004
1203571325
1204470932
1205045006
1205186694
1206609472
1209383402
1211297256
1212504859
1214473828
1217800153
1218236491
1219282533
1219801192
1220616524
1222841081
1226558804
1227645351
1229411437
1230159513
1230637415
1232098200
1232312863
1232491027
1233958398
1234452741
1237085887
1237298288
1237708765
1238281678
1238455038
1239452137
1240723226
1240980106
1241466028
1242695443
1242936270
1243921116
1245408660
1245914539
1247643815
1248306653
1249238533
1250144683
1250278843
1253111263
1254066285
1254097515
1254117382
1254371720
1255160872
1255282569
1255309049
1255779636
1257210934
1257614617
1259207077
1259923475
1261856200
1262605109
1262859049

As you can see, having filled 1/4 quarter of the total space available, false positives are just beginning to occur. And as the vector space fills, they are (as expected) becoming more frequent. (Of course, some of those may be actual duplicates. It will take a second run to determine that.)

But the question remains, how to calculate the probabilities of the mechanism.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: The statistics of hashing.
by tobyink (Canon) on Apr 01, 2012 at 09:13 UTC

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority (short-lived events of vandalism notwithstanding).

But the question remains, how to calculate the probabilities of the mechanism.

The exact calculation involves some big numbers. But assuming that c(\$x, \$y) is the "pick \$y from \$x" function used in combinatorics, then the probability of a collision for \$n strings and an evenly distributed 32-bit hash function should be:

```  \$p = 1 - ( factorial(\$n) * c(2**32, \$n) / 365**\$n )

Big numbers. Horrible to calculate. Can be approximated though...

```  sub e () { 2.718281828 }
my \$t = (\$n**2) / (2**33);
\$p = 1 - ( e ** -\$t );

Calculating \$p is still horrible, but calculating \$t is easier. If \$t is above 20 then \$p is 1.00000 when rounded to 6 significant figures.

Thus you can effectively be sure to have a collision with a 32-bit hash function once \$t is above 20. You can figure out an \$n which triggers \$t to be 20 using:

```  \$n = sqrt(20 * (2 ** 33));

It's about 414,000. So with 414,000 strings, you are effectively certain to get collision on a 32-bit hash function.

Where I think my reasoning and tye's differ (and tye is almost certainly correct here - blame it on me answering late at night) is that I was then looking at the probabilities that you will have had collisions in all four (or ten) hash functions at the end of the entire run. With even half a million strings, that is a given.

What you're actually doing is looking at events where a single string triggers a simultaneous collision in all the hash functions. I defer to tye's calculations for that.

perl -E'sub Monkey::do{say\$_,for@_,do{(\$monkey=[caller(0)]->[3])=~s{::}{ }and\$monkey}}"Monkey say"->Monkey::do'
Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority

I think the first mistake that you (and Tye) are making here, is assuming that "the birthday paradox" (as applied to the each of the 32-bit hashes individually) is the controlling factor. It isn't.

The controlling factor is the combinatorial affect of the 4 (or, in my case 10) separate hashes & vectors.

I think that applying the birthday paradox to the 128-bit hash may prove to be an accurate upper bound. But it is so large relative to the limits of the 32-bit vectors as to never actually come into play.

I had tried applying various formulae I found on wikipedia and elsewhere, but found they do not match with reality. Hence, this post.

You might also look at the math described for Bloom filters. It also contradicts these naive applications of the birthday paradox, but is sufficiently well researched and documented (sources outside of wikipedia) to show that naive application of the birthday paradox is not applicable when combining multiple hashes in this way.

The problem with the Bloom filter math is that it is based upon setting the bits derived from all the different hashes into the same vector. Ie. Each Bloom filter insert sets N bits in a single vector, where N is the number of different hashes being used. This obviously causes the vector to fill up much more quickly than my mechanism which uses a separate vector for each hash.

My gut feel, is that as my vectors fill up N times more slowly, the probabilities of collisions for my mechanism are 2**N times lower than with a Bloom filter (at the expense of N*the memory requirement). This is the notion I am trying to verify here.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Create A New User
Node Status?
node history
Node Type: note [id://962808]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2020-12-03 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How often do you use taint mode?

Results (51 votes). Check out past polls.

Notices?