 laziness, impatience, and hubris PerlMonks

### [OT] The statistics of hashing.

by BrowserUk (Pope)
 on Mar 31, 2012 at 22:03 UTC Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I'm comparing several billion long strings for uniqueness. A brute force cross comparison is out of the question. Building an in-memory hash or suffix tree is impractical. And disk-based solutions too slow.

The mechanism chosen for the task is as follows:

1. Calculate the 128-bit MD5 hash of each string.
2. Break that into 4 x 32-bit integers.
3. Use those four integers to set bits in each of 4 corresponding 2**32-bit vectors.

The question is: What is the probability of getting false positives?

That is, the odds of finding all 4 bits (1 in each vector) set, when the MD5 hash being tested has not previously be set into the vectors.

Ignoring the fact that any hashing mechanism that maps a large range (10e44 in this case) to a smaller smaller range (2**128 / 3e38) will produce duplicate mappings, and assuming that the MD5 hashing algorithm, and its subdivision into 4x 32-bit hashes provide perfect distribution.

• When setting the first value into the vectors, the odds of a false collision are obviously 0.
• When setting the second value in, the odds of all 4 sub hashes being the same, whilst the MD5 is different are again zero.
• When setting the third value in, the odds that each of the 4 sub hashes will match 1 of the two sub hashes previously set into that vector are still very small, but tangible.
• And as each new quad of sub hashes is added to the vectors, the odds of a false positive increase over the previous addition.

#### How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?

Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)

The mechanism is a variation on the theme of Bloom filters. The statistics of bloom filters is documented, but goes over my head.

I believe that using separate vectors for the sub hashes should improve (decrease) the odds of false positives, but my math is insufficient to verify that belief.

Can these probabilities be calculated in a way that a stats layman (me) will understand?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re: [OT] The statistics of hashing.
by graff (Chancellor) on Apr 01, 2012 at 00:49 UTC
Is your ultimate goal to simply estimate how many of the strings are unique (or conversely, how many are duplicates), and possibly provide a confidence measure for the estimate? Or are you hoping to come up with an exact number of actual uniques/duplicates for the whole set? (Based on your stated plan, it seems clear that you aren't trying to identify/enumerate the unique strings.) If you're just looking for an estimate, why not work on a manageable and reasonably sampled subset of the data?

In my linguistic background, "several" usually means "more than 4, less than 10". If that's what it means for you as well, and if syphilis has provided a valid metric, then once you're done processing the first 4 billion strings, each new string will definitely be counted as a duplicate, regardless of whether it's a true or false positive. (Perhaps that's not exactly an absolute certainty, but it seems close enough.)

If you're just looking for an estimate, why not work on a manageable and reasonably sampled subset of the data?

Because, according to (my application of) the formulae I found on line, I should be seeing false positive after low 10s of millions of inserts. But my testing showed none after 1e6, 10e6, 100e6, 200e6 ...

Ie. My sampling showed that the calculations were wrong, but provided no information upon which to draw any further conclusions.

The nature of the beast is that I needed to run the experiment for 3/4 of a billion iterations in order to get my first piece of actual information from it.

This start out being about verify my gut with respect to the big strings. That lead to looking for a mechanism capable of verifying uniqueness with the huge numbers involved. That lead -- through various documented mechanisms that proved inadequate -- to coming up with the current schema that appears to work substantially better than all the documented mechanisms.

This post is about trying to come up with math that fits the experimental evidence.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Re: The statistics of hashing.
by tobyink (Canon) on Mar 31, 2012 at 22:58 UTC

What is the probability of getting false positives?

Seems almost certain to me, but my reasoning may well be faulty. Stats were never my forte.

You're effectively running four different 32-bit hash functions on each string. According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.

Thus with 110,000 strings, getting a collision in all four stands at 75%**4, about 30% chance. That's with 110,000 strings. You have "several billion".

With several billion strings you are going to have a very close to 100% chance of a collision on a 32-bit hash function. With four 32-bit hash functions, that's still close to 100% chance of a collision.

Extending this to having 10 effective hash functions, this cannot increase the chance of collisions. Worst case scenario it has no effect. Here I think it does decrease the chance (difficult , but not enough to be useful.

perl -E'sub Monkey::do{say\$_,for@_,do{(\$monkey=[caller(0)]->)=~s{::}{ }and\$monkey}}"Monkey say"->Monkey::do'
Seems almost certain to me,

That's a given. As with bloom filters, false positives are a fact of life. The question is how likely, or more to the point, how frequently, they will occur.

According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

The following is empirical evidence from an on-going experiment.

After 3/4 of a billion trials, there were zero (possible) dups. After 1.2 billion, there are only 61.

These are the positions at which those possible dups were detected:

```779967210
782382025
830840395
882115420
883031614
897571477
923155269
948108745
954455113
967783959
988381482
992691311
995935158
1011301141
1013742872
1022258193
1031874023
1034026887
1036254774
1037064360
1037193945
1037309710
1037661519
1041973643
1042577179
1045603197
1046414487
1047056233
1048652112
1048960948
1052042118
1057413610
1059736720
1067108427
1068042962
1069550972
1070612878
1075526924
1079515116
1079995733
1080999379
1086256410
1098797420
1110115220
1121074450
1121078210
1121376561
1132692265
1133452843
1138397053
1141581831
1143980294
1148289231
1149508590
1149743067
1151539362
1155227820
1156087258
1156164711
1158795083
1163191879
1165623466
1167648959
1168308302
1169005605
1173266361
1173729947
1174309756
1176812463
1178780066
1181065368
1183536832
1183744519
1183964959
1185195302
1185303999
1186231527
1186435292
1187445584
1189170990
1192726357
1195889596
1198540465
1198563209
1198727961
1201384004
1203571325
1204470932
1205045006
1205186694
1206609472
1209383402
1211297256
1212504859
1214473828
1217800153
1218236491
1219282533
1219801192
1220616524
1222841081
1226558804
1227645351
1229411437
1230159513
1230637415
1232098200
1232312863
1232491027
1233958398
1234452741
1237085887
1237298288
1237708765
1238281678
1238455038
1239452137
1240723226
1240980106
1241466028
1242695443
1242936270
1243921116
1245408660
1245914539
1247643815
1248306653
1249238533
1250144683
1250278843
1253111263
1254066285
1254097515
1254117382
1254371720
1255160872
1255282569
1255309049
1255779636
1257210934
1257614617
1259207077
1259923475
1261856200
1262605109
1262859049

As you can see, having filled 1/4 quarter of the total space available, false positives are just beginning to occur. And as the vector space fills, they are (as expected) becoming more frequent. (Of course, some of those may be actual duplicates. It will take a second run to determine that.)

But the question remains, how to calculate the probabilities of the mechanism.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Unfortunately, a good deal of what you read on wikipedia is less than reliable.

Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority (short-lived events of vandalism notwithstanding).

But the question remains, how to calculate the probabilities of the mechanism.

The exact calculation involves some big numbers. But assuming that c(\$x, \$y) is the "pick \$y from \$x" function used in combinatorics, then the probability of a collision for \$n strings and an evenly distributed 32-bit hash function should be:

```  \$p = 1 - ( factorial(\$n) * c(2**32, \$n) / 365**\$n )

Big numbers. Horrible to calculate. Can be approximated though...

```  sub e () { 2.718281828 }
my \$t = (\$n**2) / (2**33);
\$p = 1 - ( e ** -\$t );

Calculating \$p is still horrible, but calculating \$t is easier. If \$t is above 20 then \$p is 1.00000 when rounded to 6 significant figures.

Thus you can effectively be sure to have a collision with a 32-bit hash function once \$t is above 20. You can figure out an \$n which triggers \$t to be 20 using:

```  \$n = sqrt(20 * (2 ** 33));

It's about 414,000. So with 414,000 strings, you are effectively certain to get collision on a 32-bit hash function.

Where I think my reasoning and tye's differ (and tye is almost certainly correct here - blame it on me answering late at night) is that I was then looking at the probabilities that you will have had collisions in all four (or ten) hash functions at the end of the entire run. With even half a million strings, that is a given.

What you're actually doing is looking at events where a single string triggers a simultaneous collision in all the hash functions. I defer to tye's calculations for that.

perl -E'sub Monkey::do{say\$_,for@_,do{(\$monkey=[caller(0)]->)=~s{::}{ }and\$monkey}}"Monkey say"->Monkey::do'

Wikipedia is correct (well, agrees quite closely with some of my calculations). But you have extrapolated from that quite incorrectly.

0.75**4 is the odds of, after inserting 1.1e5 strings, that there has been at least one collision in each of the four hashes. It is not even close to the odds of there being a single string that produced a simultaneous collision in all 4 hashes.

Those odds would be more like 0.75*(0.75/1.1e5)**3 or about 2e-16. My calculations came up with an upper bound of 9.5e-15.

I believe the 9.5e-15 number is much, much more accurate. It only discounts the possibility of there being fewer than 1.1e5 bits set. I suspect the ratio of "expected bits set"/1.1e5 to be very close to 1.

While the 2e-16 number is based on there being a 75% chance of exactly one collision. But there is a 25% chance of 0 collisions and a 75% chance of 1 or more collisions. And the possibility of 2 collisions approximately doubles the odds for some subset of the possible outcomes and so has a much bigger impact on the final total than does the possibility of there being 1.1e5-1 bits set instead of 1.1e5.

For example, consider tossing a coin twice. There is a 75% chance of getting at least one "head". Do 4 such trials. For each head, assign it a random position in 1.1e5 slots (but only one head per slot per trial). The odds of the 4 trials producing 4 heads in one of those slots (one from each trial) one might estimate as:

```   .75* (.75/x)   * (.75/x)   * (.75/x)

Where x==1.1e5. Or, 2.3e-16. But to find the accurate odds you have to realize that there is a 50% chance of one "head" (per trial) and a 25% change two "heads" (per trial). So the actual odds are found via:

```   .5 * (.5/x)    * (.5/x)    * (.5/x)
+ .5 * (.25*2/x) * (.5/x)    * (.5/x)
+ .5 * (.5/x)    * (.25*2/x) * (.5/x)
+ .5 * (.25*2/x) * (.25*2/x) * (.5/x)
+ .5 * (.5/x)    * (.5/x)    * (.25*2/x)
+ .5 * (.25*2/x) * (.5/x)    * (.25*2/x)
+ .5 * (.5/x)    * (.25*2/x) * (.25*2/x)
+ .5 * (.25*2/x) * (.25*2/x) * (.25*2/x)
+ .25* (.5/x1)   * (.5/x1)   * (.5/x1)
+ .25* (.25*x12) * (.5/x1)   * (.5/x1)
+ .25* (.5/x1)   * (.25*x12) * (.5/x1)
+ .25* (.25*x12) * (.25*x12) * (.5/x1)
+ .25* (.5/x1)   * (.5/x1)   * (.25*x12)
+ .25* (.25*x12) * (.5/x1)   * (.25*x12)
+ .25* (.5/x1)   * (.25*x12) * (.25*x12)
+ .25* (.25*x12) * (.25*x12) * (.25*x12)

Where x1==1.1e5-1 and x12==1/(1.1e5-1)+1/(1.1e5-2). That raises the odds by a factor of 2.37 to 5.6e-16.

That type of magnification of the odds will be much more pronounced in the case originally discussed and I bet would result in nearly the 48-fold increase required to get to the odds of my upper bound calculation.

- tye

Thus with 110,000 strings, getting a collision in all four stands at 75%**4, about 30% chance. That's with 110,000 strings. You have "several billion".
If, after 110,000 strings you get a 75% on a collision, 75%4 gives you the chance 4 sets of 110,000 each give you a collision. But that's nowhere near the probability the same pair gives you a collision. Which is what BrowserUK needs.

It would take me too much time to dig up the math required to do the calculation. But what BrowserUK may try to do: generate a billion strings that are all unique. Run the proposed algorithm. See how many collisions are reported: they're all false positives. Repeat several times (with different strings). You may even want to calculate a confidence interval. Of course, I've no idea how feasible this is. If it takes a week to process a data set this size, you may not want to do so.

Re: [OT] The statistics of hashing.
by syphilis (Bishop) on Mar 31, 2012 at 23:13 UTC
How to calculate those odds for each new hash, as the number of hashes already set into the vectors increases?

Update: Following figure is not right. (Corrected figure provided in following post.)

IIUC it's just (N/4294967296)**4, where N is the number of MD5 hashes that have already been entered into the bit vector.
But that's assuming that MD5 hashes distribute evenly, and I don't know if that has been established (or disproved). If they don't distribute evenly, then the odds of hitting a false positive will increase.

Not sure what affect your "Bloom Filters" variation would have.

Cheers,
Rob
IIUC it's just (N/4294967296)**4

I believe the figure I should have provided is:
(1 - ((4294967295/4294967296) ** N)) ** 4

And if that's not right, I give up.

Update: The logic is simple. (Danger, Will Robinson ;-)
If you're picking numbers at random in the range (1 .. 4294967296), then the probability that the N+1th pick has already come up is:
1 minus the probability that it has *not* already come up
and the probability that it has *not* already come up is (4294967295/4294967296) ** N

So that gives us the
1 - ((4294967295/4294967296) ** N)

But because we're looking for the case where all *4* numbers have already come up, that probability needs to be raised to the 4th power ... which yields the final figure.

Cheers,
Rob

Thanks syphilis. Your calculations make sense to me. But I'm not sure that it gels with the actual data?

Assuming I've coded your formula correctly (maybe not!), then using 10 hashes & vectors, I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123.

By that point I had actually seen 13 collisions:

And looking at the figure for 4e9 := 0.00667569553892502, by which time the 10 vectors will be almost fully populated, it looks way too low to me?

I would have expected that calculation (for N=4e9) to have yielded odds of almost 1?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Re: [OT] The statistics of hashing. (birthday)
by tye (Sage) on Apr 01, 2012 at 04:03 UTC

Although this sounds similar to the birthday problem, it is mostly much simpler (and partly much harder).

The odds of a collision at each step are quite simply \$setBits/2**32. So the odds of having 4 collisions in a row are prod(@setBits)/2**(4*32). So, if after inserting 100 items, your 2**32-bit vectors have, respectively, 99, 100, 100, and 98 bits set, then the odds of a 4-fold collision is 99*100*100*98/2**(32*4) or about 1 in 1e31.

Now, the odds of there being 100 bits set after setting 100 of the bits, that is the same as the birthday problem. So, if you wanted to predict the likely odds of a collision after inserting, say, 1e4 strings, then you would first compute the odds of the full 1e4 different bits being set. The birthday problem tells us that the odds are about 98.84% that all 1e4 bits would be set in a given vector. Or that there is a 1.16% chance of at least one collision (in a given vector).

```Bits Collision
set  odds (2**32 bits)
---  ---------
1e3    0.01%
1e4    1.16%
1e5   68.8%
1e6  100% (to about 50 digits of accuracy)

As calculated using Math::BigApprox:

```use Math::BigApprox 'c';
my \$bits = 2**32;
my \$inserts = 1e5;
print 1-(c(\$bits-\$inserts)^c(\$bits-1))/c(\$bits)**\$inserts, \$/;

So, when you insert 1e6 strings, you are not going to end up with 1e6 bits set. But getting a precise value for "expected number of bits set" is rather complex. But I wouldn't bother. I'd just count the number of bits that are actually set as you go along.

Then, when one goes to insert the next string, just calculate \$b1*\$b2*\$b3*\$b4/2**(32*4) in order to get the odds of a false positive on that insertion.

But inserting about 1e6 1e9 strings into 2**32-bit vectors means that about 23% of the bits are going to be set. So, your 1e6th 1e9th insert is going to have about a 0.28% chance of a false positive.

If you want to calculate the odds of there having been at least one false positive across the entire sequence of 1e6 inserts, then you'd compute 1-prod(@odds) where @odds contains the odds of no false positive at each step. So \$odds[\$i] would be 1 - \$b1*\$b2*\$b3*\$b4/2**(32*4). So you can calculate such that way, by just keeping a running product of that value.

Note, however, that prod(@odds) will stay exactly 1.0 when using floating point until you get to the point that \$b1*\$b2*\$b3*\$b4/2**(32*4) creeps above about 1/2**\$mantissaBits, (which won't happen until about 1/2 million inserts for typical 'double'). I'm not sure how much of an impact that will have on the accuracy when getting to the only interesting odds in the late stages. But it seems it might be quite significant.

But you can compensate for this by instead tracking 1-prod(@odds). That starts out simply as 1*1*1*1/2**(32*4) after the first insert. To effectively calculate 1-prod(@odds[0..\$n]) when given 1-prod(@odds[0..\$n-1]) and \$odds[\$n] isn't hard. It is just \$prod += \$odds-\$prod*\$odds (and that last term doesn't matter for a very long time). Just to be clear, \$odds is just \$b1*\$b2*\$b3*\$b4/2**(32*4) where \$b1 is the number of bits set in the first vector.

So, doing that calculation as you go will give you the odds of having had at least one false positive so far. After 1e5 inserts it will only be about 6e-15, depending on how many bit collisions you run into by then.

You can also get an upper bound on it for however many strings you have by just running through the 1e9+ calculations assuming no single-bit collisions ever happen. I won't wait for that to finish running here.

But it turns out that I don't have to:

```Odds  Inserts
----  -------
1%  ~28e6
10%  ~45e6
50%  ~65e6
90%  ~83e6
99%  ~96e6

- tye

```Odds  Inserts
----  -------
1%  ~28e6
10%  ~45e6
50%  ~65e6
90%  ~83e6
99%  ~96e6

By those values, the odds against not having seen a duplicate by the time you reached 100 million inserts are so low as to be a pretty damn good definition of 'impossible'.

And yet, empirically, none had been seen by the time I reached 779,967,210.

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

The reality is that I've only had 1323 collisions after 1.5 billion inserts. These are the last 8:

```...
1585072919
1585138596
1585355107
1585418593
1585422956
1585468018
1585558107
1585577984

They are coming much more regularly now, but they are actually still within the bounds of usability.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

You appear to have not fully understood several parts of what I wrote. (Which is not an insult.)

And after 1.5 billion inserts, that calculation suggests that the odds of finding a value that doesn't match would be minuscule, and the "possible dups" count should be growing at almost the same rate as the new inserts are being tested.

That actually is not a valid conclusion from anything I wrote.

The odds of the next insert being a collision after about 1.5e9 inserts is about (1.5e9/2**32)**4 or under 1.5%, and 98.5% is not "miniscule" (and about 1/100th as fast is not "almost the same rate"). But that is also ignoring that there won't actually be 1.5e9 bits set. But you make no mention of how many bits are set which is how you'd get an accurate calculations so I'm curious if you perhaps didn't fully appreciate that point (as opposed to some other reason like it perhaps not being easy to query the number of bits set in your current run, etc.).

Probably much more important is that in some replies you hint that you are using 10 hashes not 4 hashes now.

I suspect that is the case in the numbers you show it your reply above, because you don't show collisions happening on the order of every 1e2 inserts. With 10 hashes, the above odds go to around 2.7e-5. And that is in the ballpark of what you report seeing.

If you need some help redoing the calculations for 10 hashes instead of 4, then please speak up. I was not exhaustive in my explanations nor expansions of formulae, so I would expect more clarification could be required. I can also understand you not taking the time to seriously study what I wrote yet given how the numbers seem so far off from your observations.

- tye

Re: [OT] The statistics of hashing.
by flexvault (Monsignor) on Apr 01, 2012 at 13:52 UTC

This is an observation sometime from 2005-2008. My handle on PM is because of working on this project at the time I joined, but I have little to due with the project now. It was a project to backup Terabytes of encrypted files to geographically distant off-site computer centers using 'rsync' in close to 'real-time'. The client's goal was that 'rsync' not work with the real files, but only with encrypted files which sort of negates the value of using 'rsync'. While we looked at several methods for getting a value for naming the encrypted files, we ran tests of several different sets of 1 million encrypted files and the following table is an average.

```  Technique          Duplicates

MD5                5,701
crc32             26,323
SimCRC64              17       Simple 8 byte ^ of file (very fast
+)

I can't say these results would be the same anywhere else, but when I started testing, I was sure that MD5 would be the most unique, and that's why we ran the tests several times with different file sets. I don't know if the encryption algorithm affected the results and don't know it the unencrypted files would give similiar results or not. I just didn't expect the results!

I only bring this up because I think you started out with the same assumption about MD5, but maybe there is a better starting point and maybe not.

Thank you

"Well done is better than well said." - Benjamin Franklin

I only bring this up because I think you started out with the same assumption about MD5, ...

That is a very good point.

The distribution of any (perfect(*)) hashing algorithm will only be fully utilised if the inputs can take on all possible values for any given input length.

For example, ASCII or ISO text, certain parts of the input range -- many of the control characters <32; and characters >126 -- will not appear in the inputs, so for any given length, the range of inputs to the hashing algorithm is biased. It stand to reason that the outputs will similarly biased and the full range of the hash values will not be produced.

It is also the case that for any given input language, there are some character combinations that will never appear in normal text -- eg. ZXZXZXZXZXZXZXZXZXZ -- hence the inputs, and thus outputs will be further biased.

For this particular part of the exercise, I'm more concerned with the collision detection mechanism than the vagaries of the hashing.

But thank you for raising the issue. I may well contact you later for the details of your SimCRC64 algorithm.

(*)Note: I am aware that MD5 is probably far from perfect.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

BrowserUk,

...ASCII or ISO text, certain parts of the input range -- many of the control characters <32; and characters >126 -- will not appear...
I hadn't thought about it but since the files were encrypted, more character sequences might appear.

As far as the 'SimCRC64' algorithm, it was just a simple CRC routine. I don't have the actual code, but it was something like this.

```
my \$crc = crc64(\\$output);           ## Add crc check to output data

sub crc64           ## Add crc check to output data
{   my \$data = shift; my \$crc = ""; my \$len = length(\$\$data); my \$pos
+ = 0;
while ( \$pos < \$len )
{  \$crc = \$crc ^ ( substr(\$\$data,\$pos,8) );
\$pos += 8;
}
return ( \$crc );
}

It was part of the test to see what the worst case would be. As I previously said, I didn't expect the results.

Thank you

"Well done is better than well said." - Benjamin Franklin

Re: [OT] The statistics of hashing.
by moritz (Cardinal) on Apr 01, 2012 at 10:44 UTC

I'm not sure I understood the question correctly, but I hope my answer is still helpful for you.

Further to that, if the number of sub hashes and vectors is increased to 10, by permuting each of the four sub hashes against the other 3 and XORing, does that substantially decrease the odds of false matches? (Or increase them? Or leave them the same?)

Statistically, the only thing that matters for the probability of collisions is the total entropy of the hash. If you create a 128bit hash over a string, you have 128 bits of entropy. If you create two 128bit hashes, one over the first part and one over the second part of the string, you have 256 bits of entropy. If you XOR the two hashes, and use the results, you have 128 bits again.

If you use one 128bit hash, the probabilities for getting no collision on inserting the nth number is

```P(no_collision, n) = (1 - 1/2**128) ** n

I'm pretty sure that the number of collisions follows (at least approximately) a Poisson distribution, though I haven't yet figured out how to calculate the expectation value.

Hm. Attempting calculate 1 - 1/2**128 using my calculator just returns 1.

So then I tried Math::BigFloat, and still got just 1. What Am I doing wrong?

```use Math::BigFloat;

\$moritzFactor = Math::BigFloat->new( 1 )->bsub(
Math::BigFloat->new( 1 )->bdiv(
Math::BigFloat->new( 2 )->bpow( 128 )
)
);

print \$moritzFactor;
;;
1

 Perl> print \$moritzFactor->bstr;;
1

 Perl> print \$moritzFactor->bsstr;;
1e+0

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

You'd need a floating-point number with a mantissa of about 128 bits to hold the value of (1-1/2**128)

Luckily there are other ways to evaluate the expression. If we set x = 1/2**128, we can do a taylor series to evaluate (1-x)**n around x = 0 (and since x is very small, it's a good approximation:

```(1 - x)^n = 1 - n * x + 1 / 2 * n * (n - 1) * x ** 2 + ...

Where higher orders can be neglected. If you chose an n that is high enough, you can get non-1 values as the result.

Though I'm not sure if that's really helpful. What you'd really need is the expectation value of the number of collisions (the lambda symbol in Poisson Distribution, then you could easily calculate everything you wanted.

Re: [OT] The statistics of hashing.
by roboticus (Chancellor) on Apr 02, 2012 at 13:57 UTC

That problem sure was fun! I've got a handle on it, but I didn't do the math rigorously, but it agrees with my simulations. Anyway, here's what I think is going on:

1) The density of 1 bits is:

```                -x/N
d = 1 - e

2) The probability of the next item colliding is:

```                   -x/N  h
p(hit) = (1 - e     )

If that's true, then I expect that the number of hits is just the integral of p(hit) over x in (0..records). (I haven't actually done the integration, but I don't foresee any hitches.) I found an on-line symbolic integrator that you can use to determine it.

Anyway, if you want to play with it, here's the simulator code along with the sample output:

...roboticus

When your only tool is a hammer, all problems look like your thumb.

I'm done with the program now. I added a column to show the predicted number of false positives. The code is nearly unchanged. Basically, I moved:

```    my \$expected = (1-exp(-\$cnt_rex/\$VecSize))**\$num_vecs;

outside of the reporting section, and then accumulate it:

```    \$cum_collisions += \$expected;

It's not very fast, so I only ran it a couple of times, but it seems like it gives reasonable agreement:

```\$ ./analyze_bloom.pl 14 3 17
NumVecs=3
avg      avg   calc  --------- # collisions ---------
+            expected
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+  p(coll)   false pos
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+----------- -------------
14        256:     253   1.55   1.55       7       7       0       0 0
+.0000037264 0.0002414791
14        512:     504   3.08   3.08      22      22       0       0 0
+.0000291236 0.0037774699
14       1024:     993   6.06   6.06      89      87       2       0 0
+.0002224011 0.0581208314
14       2048:    1926  11.76  11.75     336     309      26       1 0
+.0016223627 0.8630370129
14       4096:    3637  22.20  22.12    1185    1005     170      10 0
+.0108230772 11.9418752510
14       8192:    6414  39.15  39.35    3986    2802    1022     162 0
+.0609161842 144.4751474482
14      16384:   10300  62.87  63.21   11216    5615    4166    1435 0
+.2525804578 1374.7071068207
14      32768:   14151  86.37  86.47   27361    7789   10655    8917 0
+.6464623148 8946.4018915720
14      65536:   16105  98.30  98.17   60117    8201   15657   36259 0
+.9460533270 36391.1792023071

\$ ./analyze_bloom.pl 16 4 18
NumVecs=4
avg      avg   calc  ------------- # collisions ------
+-------             expected
BI  Samples   one bits pct 1s pct 1s   all       1       2       3
+   4      p(coll)   false pos
-- ---------- -------- ------ ------ ------- ------- ------- ------- -
+------ ------------ ----------
16        256:     255   0.39   0.39       3       3       0       0
+     0 0.0000000002 0.0000000120
16        512:     510   0.78   0.78       5       5       0       0
+     0 0.0000000037 0.0000003784
16       1024:    1018   1.55   1.55      21      21       0       0
+     0 0.0000000578 0.0000119226
16       2048:    2021   3.08   3.08     107     106       1       0
+     0 0.0000008960 0.0003713063
16       4096:    3979   6.07   6.06     444     423      20       1
+     0 0.0000134746 0.0112771478
16       8192:    7749  11.82  11.75    1615    1466     141       8
+     0 0.0001906326 0.3256727521
16      16384:   14569  22.23  22.12    5843    4608    1066     158
+    11 0.0023940562 8.5228328538
16      32768:   25931  39.57  39.35   18293   11116    5475    1529
+   173 0.0239686508 185.0883611550
16      65536:   41690  63.61  63.21   49093   18714   17246   10354
+  2779 0.1596613002 2882.5123468408
16     131072:   56843  86.74  86.47  114278   21592   29488   36446
+ 26752 0.5589731543 26626.3779623356

The updated code:

Both runs are still underway, I'll update this when they complete. Done.

Update: Added rest of the tables. As you can see from the two sample runs, there's good agreement between the actual false positive count and the expected count.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

OK, I think I'm done with this diversion. I've built a program that will compute the expected number of collisions with no looping. Example output:

```\$ ./ana_2.pl 16 4 65536
N=65536, V=4, X=65536 integral(65536)=139415.765849051, integral(0)=13
+6533.333333333
Expected collisions: 2882.43251571807

\$ ./ana_2.pl 16 4 16384
N=65536, V=4, X=16384 integral(16384)=136541.854969116, integral(0)=13
+6533.333333333
Expected collisions: 8.52163578287582

\$ ./ana_2.pl 14 3 16384
N=16384, V=3, X=16384 integral(16384)=31411.9141476821, integral(0)=30
+037.3333333333
Expected collisions: 1374.58081434877

\$ ./ana_2.pl 16 10 32768
N=65536, V=10, X=32768 integral(32768)=191953.190301726, integral(0)=1
+91952.863492063
Expected collisions: 0.326809662627056

The code is straightforward:

Notes:

1. The only testing I did is limited to the cases above. Be sure to run a few more before relying on it.
2. I entered the ex_1 and ex_2 routines, but didn't test 'em.
3. For the other numbers of vectors, just expand (1 - exp(-X/N))^V, and integrate it. The ex_1..n functions are coded straight from the integration.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Thank you roboticus.

I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning!

```#! perl -slw
use strict;

sub ex_10 {
my (\$N, \$X) = @_;
return       -\$N/10*exp(-10*\$X/\$N)
+  10*\$N/9 *exp( -9*\$X/\$N)
-  45*\$N/8 *exp( -8*\$X/\$N)
+ 120*\$N/7 *exp( -7*\$X/\$N)
-  35*\$N   *exp( -6*\$X/\$N)
+ 252*\$N/5 *exp( -5*\$X/\$N)
- 105*\$N/2 *exp( -4*\$X/\$N)
+  40*\$N   *exp( -3*\$X/\$N)
-  45*\$N/2 *exp( -2*\$X/\$N)
+  10*\$N   *exp( -1*\$X/\$N)
+ \$X;
}

my \$exp10_0 = ex_10( 2**32, 0 );

open I, '<', 'L25B32V10I32S1.posns' or die \$!;

while( <I> ) {
my( \$inserts, \$collisions ) = split;
my \$exp3 = ex_10( 2**32, \$inserts ) - \$exp10_0;
printf "Insert  %10d: Predicted  %6d Actual: %6d Delta: %+.3f%%\n"
+,
\$inserts, \$exp3, \$collisions, ( \$collisions - \$exp3 ) / \$exp3
+* 100;
}

__END__
C:\test>rbtcs-form-verify
Insert   779967210: Predicted       1 Actual:      1 Delta: -18.051%
Insert   782382025: Predicted       1 Actual:      2 Delta: +58.813%
Insert   830840395: Predicted       2 Actual:      3 Delta: +29.292%
Insert   882115420: Predicted       4 Actual:      4 Delta: -5.961%
Insert   883031614: Predicted       4 Actual:      5 Delta: +16.325%
Insert   897571477: Predicted       5 Actual:      6 Delta: +18.390%
Insert   923155269: Predicted       6 Actual:      7 Delta: +4.086%
Insert   948108745: Predicted       8 Actual:      8 Delta: -8.996%
Insert   954455113: Predicted       9 Actual:      9 Delta: -4.244%
Insert   967783959: Predicted      10 Actual:     10 Delta: -7.404%
Insert   988381482: Predicted      13 Actual:     11 Delta: -17.487%
Insert   992691311: Predicted      13 Actual:     12 Delta: -13.814%
Insert   995935158: Predicted      14 Actual:     13 Delta: -9.624%
Insert  1011301141: Predicted      16 Actual:     14 Delta: -16.457%
Insert  1013742872: Predicted      17 Actual:     15 Delta: -12.616%
Insert  1022258193: Predicted      18 Actual:     16 Delta: -14.242%
Insert  1031874023: Predicted      20 Actual:     17 Delta: -16.989%
Insert  1034026887: Predicted      20 Actual:     18 Delta: -13.909%
Insert  1036254774: Predicted      21 Actual:     19 Delta: -11.051%
Insert  1037064360: Predicted      21 Actual:     20 Delta: -7.093%
Insert  1037193945: Predicted      21 Actual:     21 Delta: -2.569%
Insert  1037309710: Predicted      21 Actual:     22 Delta: +1.957%
...( ~ 52000 ellided }
Insert  2375752842: Predicted   52209 Actual:  52277 Delta: +0.130%
Insert  2375755671: Predicted   52209 Actual:  52278 Delta: +0.131%
Insert  2375756509: Predicted   52209 Actual:  52279 Delta: +0.133%
Insert  2375760656: Predicted   52210 Actual:  52280 Delta: +0.133%
Insert  2375763928: Predicted   52211 Actual:  52281 Delta: +0.134%
Insert  2375785238: Predicted   52215 Actual:  52282 Delta: +0.128%
Insert  2375788721: Predicted   52215 Actual:  52283 Delta: +0.128%
Insert  2375789878: Predicted   52216 Actual:  52284 Delta: +0.130%
Insert  2375790896: Predicted   52216 Actual:  52285 Delta: +0.131%
Insert  2375798283: Predicted   52217 Actual:  52286 Delta: +0.131%

And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there.

My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass:

```C:\test>rbtcs-bloom-probb 32 10 30e8
N=4294967296, V=10, X=30e8 integral(30e8)=12580199467.653, integral(0)
+=12579822861.8159
Expected collisions: 376605.837186813

That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run.

The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently.

And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

With the 'cancelling out' you've performed on the constants in your ex_*() subs, I couldn't see the pattern by which they were derived.

Undoing that, I now see that they come (directly or not) from Pascal's Triangle.

I guess at some point I should review some teaching materials to understand why those constants are used here, but for now it is enough to know that I can now write a generic ex_*() function (generator). Thanks.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Oh, something I forgot to mention: I tried using a constant number of bits but varying the vector size/quantity to see how things scaled. In other words, I compared:

vec size# vectors
10,000 1
5,000 2
3,333 3
2,500 4
2,000 5
1,00010

I found more smaller vectors works better until the number of samples matches the number if bits in the smaller vector. Plotting the functions:

```(1-exp(-x/1000))^10
(1-exp(-x/2000))^5
(1-exp(-x/2500))^4
(1-exp(-x/3333))^3
1-exp(-x/10000)

using a graphing calculator shows that's where the curves cross:

...roboticus

When your only tool is a hammer, all problems look like your thumb.

I should review my transcendental function definitions as power series so I can recognize these correlations in future. I knew how to precisely measure the odds but had no clue how to search to see if there were any way to collapse it to some transcendental function (the exact measurement is quite impractical for very large numbers of inserts).

But having 1-exp(-\$inserts/\$slots) shown to me, I was able to play with that approximation to see how accurate it is. For this application, it is actually an impressive fit.

If 1-exp(-1/\$slots) starts out very close to 1/\$slots (the correct starting value for the odds of a single-bit collision), then the accuracy never decreases as \$inserts grows. For small \$slots, 1-exp(-1/\$slots) is not very accurate but 1-exp(-\$inserts/\$slots) slowly becomes more and more accurate as \$inserts increases.

For \$slots == 2**32, 1-exp(-\$inserts/\$slots) starts out (at 1==\$inserts) so accurate that a 'double' can't even measure the error. So you can consider the calculations based on it "perfect" rather than an approximation. :)

- tye

Re: [OT] The statistics of hashing.
by sundialsvc4 (Abbot) on Apr 02, 2012 at 13:33 UTC

What is the time interval that you have available to come up with a solution given “billions of” records?

If the order of the records cannot be controlled, then a meaningful result can only be obtained by processing the entire dataset, which is not possible.   Therefore, I truly see no alternative but that of performing multiple passes through the data, processing a mutually exclusive subset of the data each time.

The results you describe in reply #2 seem to point to a reasonable compromise-solution:   if 300K records produce 61 maybes, and if (say...) 3 million records at a time produce a few thousand, you have reduced the problem for that pass to a serviceable level.   However, as the algorithm begins to become saturated, the ink-blots start to run together and you’re left with a black piece of paper and not a picture.   This will force you into a possibly large number of passes, that can be processed in parallel say on a cluster box.

You say that “a disk based solution isn’t feasible,” but a fair amount of that is going to happen regardless.   The information-content of any hash key, however derived, is considerably less than that of the record, and the mere fact that you haven’t yet seen what appears to be a duplicate of a particular record does not usefully predict that you won’t stumble into one with the very next record not yet seen.   There will unfortunately be an extremely narrow “sweet spot” in which the discriminating power of the algorithm will be usefully preserved.

You say that �a disk based solution isn�t feasible,�

Using my hashing/vectors algorithm, my script processes 10,000 strings/second. To process 2**32 strings (the absolute limit of the hashes) would require 2**32 / 10000 / 3600 = 119 hours. IE. 0.000001s/string.

Using any sort of disk-based mechanism -- disk-based hash, b-tree, etc. -- would require at least 1 read and 1 write per string. Using the fastest (conventional) disks available that means adding at least 0.004 seconds for the read and 0.005 seconds for the write, to the processing time for every string.

So, that's 0.000001 + 0.004 + 0.005 = 0.009001 * 2**32 / (3600) = 10738 hrs or 447 days. A little under 1 1/4 years.

And the reality is that it will likely require at least 10 reads & 10 writes for every record. I'll leave that math to you.

but a fair amount of that is going to happen regardless.

When I stopped the process this morning, it had been running just under 62.5 hours.

In that time it had processed 2,375,798,283 strings and detected just 52,286 possible dups.

The cost of writing those 52,000 strings/positions to disk during the 62.5 hours of runtime is negligible; unmeasurable. Lost in the noise of processor variability due to variations in mains voltage. Ziltch.

Running a second pass to verify those 52,000 possible dups as false or real will require less time as a simple hash lookup can be used, so there is no need to maintain the vectors.

In any case, I'll trade 125 hours of runtime, for 1 1/4 years, every day of the week.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Obviously, you assumed that I was just making “my usual suggestion” when I am not.   Your algorithm is quite plainly at least the beginnings of the only feasible solution in this case ... especially given that the actual number of duplicates is known to be negligible.   (0.0022% ...)   If the data volumes are in the 2-billion range and you are certain that you could only be committing a Type-2 error with no possibility of Type-1, then you have solved the problem right now.   You produced a highly sensitive first-pass filter.   99.98% of the records have been excluded.   That’s as solved as you can get ... and what you already know about the expected distribution (that is it is overwhelmingly likely that the records are unique) makes it so.

In effect your algorithm is to calculate a digital-digest of the entire record and then to look for possible collisions among those digests.   The discriminating ability of the algorithm will be that of MD5 and nothing more or less.   Taking the MD5 result and using the left/right-most n bits would be statistically stronger than applying a separate reducing calculation to the entire result.   MD5 and its brethren scatter entropy fairly-uniformly across all of the bits at once as they are designed to do.   The distribution of values ought to be very flat, and any subset of the total bits in the vector, if not combined with others, will be equally flattened and therefore equally conducive to your hash.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://962802]
Approved by GrandFather
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2020-07-13 02:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?