in reply to [OT] The statistics of hashing.
What is the probability of getting false positives?
Seems almost certain to me, but my reasoning may well be faulty. Stats were never my forte.
You're effectively running four different 32-bit hash functions on each string. According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision.
Thus with 110,000 strings, getting a collision in all four stands at 75%**4, about 30% chance. That's with 110,000 strings. You have "several billion".
With several billion strings you are going to have a very close to 100% chance of a collision on a 32-bit hash function. With four 32-bit hash functions, that's still close to 100% chance of a collision.
Extending this to having 10 effective hash functions, this cannot increase the chance of collisions. Worst case scenario it has no effect. Here I think it does decrease the chance (difficult , but not enough to be useful.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: The statistics of hashing.
by BrowserUk (Patriarch) on Mar 31, 2012 at 23:26 UTC | |
Seems almost certain to me, That's a given. As with bloom filters, false positives are a fact of life. The question is how likely, or more to the point, how frequently, they will occur. According to Wikipedia for a 32-bit hash you only need about 110,000 hashes to get a 75% chance of a collision. Unfortunately, a good deal of what you read on wikipedia is less than reliable. The following is empirical evidence from an on-going experiment. After 3/4 of a billion trials, there were zero (possible) dups. After 1.2 billion, there are only 61. These are the positions at which those possible dups were detected:
As you can see, having filled 1/4 quarter of the total space available, false positives are just beginning to occur. And as the vector space fills, they are (as expected) becoming more frequent. (Of course, some of those may be actual duplicates. It will take a second run to determine that.) But the question remains, how to calculate the probabilities of the mechanism. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by tobyink (Canon) on Apr 01, 2012 at 09:13 UTC | |
Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority (short-lived events of vandalism notwithstanding).
The exact calculation involves some big numbers. But assuming that c($x, $y) is the "pick $y from $x" function used in combinatorics, then the probability of a collision for $n strings and an evenly distributed 32-bit hash function should be:
Big numbers. Horrible to calculate. Can be approximated though...
Calculating $p is still horrible, but calculating $t is easier. If $t is above 20 then $p is 1.00000 when rounded to 6 significant figures. Thus you can effectively be sure to have a collision with a 32-bit hash function once $t is above 20. You can figure out an $n which triggers $t to be 20 using:
It's about 414,000. So with 414,000 strings, you are effectively certain to get collision on a 32-bit hash function. Where I think my reasoning and tye's differ (and tye is almost certainly correct here - blame it on me answering late at night) is that I was then looking at the probabilities that you will have had collisions in all four (or ten) hash functions at the end of the entire run. With even half a million strings, that is a given. What you're actually doing is looking at events where a single string triggers a simultaneous collision in all the hash functions. I defer to tye's calculations for that.
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 01, 2012 at 09:34 UTC | |
Indeed, but the errors on Wikipedia are not evenly distributed. On a subject such as the birthday attack I'd expect Wikipedia's article to be on par with any other authority I think the first mistake that you (and Tye) are making here, is assuming that "the birthday paradox" (as applied to the each of the 32-bit hashes individually) is the controlling factor. It isn't. The controlling factor is the combinatorial affect of the 4 (or, in my case 10) separate hashes & vectors. I think that applying the birthday paradox to the 128-bit hash may prove to be an accurate upper bound. But it is so large relative to the limits of the 32-bit vectors as to never actually come into play. I had tried applying various formulae I found on wikipedia and elsewhere, but found they do not match with reality. Hence, this post. You might also look at the math described for Bloom filters. It also contradicts these naive applications of the birthday paradox, but is sufficiently well researched and documented (sources outside of wikipedia) to show that naive application of the birthday paradox is not applicable when combining multiple hashes in this way. The problem with the Bloom filter math is that it is based upon setting the bits derived from all the different hashes into the same vector. Ie. Each Bloom filter insert sets N bits in a single vector, where N is the number of different hashes being used. This obviously causes the vector to fill up much more quickly than my mechanism which uses a separate vector for each hash. My gut feel, is that as my vectors fill up N times more slowly, the probabilities of collisions for my mechanism are 2**N times lower than with a Bloom filter (at the expense of N*the memory requirement). This is the notion I am trying to verify here. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re^2: The statistics of hashing. (**4)
by tye (Sage) on Apr 01, 2012 at 07:54 UTC | |
Wikipedia is correct (well, agrees quite closely with some of my calculations). But you have extrapolated from that quite incorrectly. 0.75**4 is the odds of, after inserting 1.1e5 strings, that there has been at least one collision in each of the four hashes. It is not even close to the odds of there being a single string that produced a simultaneous collision in all 4 hashes. Those odds would be more like 0.75*(0.75/1.1e5)**3 or about 2e-16. My calculations came up with an upper bound of 9.5e-15. I believe the 9.5e-15 number is much, much more accurate. It only discounts the possibility of there being fewer than 1.1e5 bits set. I suspect the ratio of "expected bits set"/1.1e5 to be very close to 1. While the 2e-16 number is based on there being a 75% chance of exactly one collision. But there is a 25% chance of 0 collisions and a 75% chance of 1 or more collisions. And the possibility of 2 collisions approximately doubles the odds for some subset of the possible outcomes and so has a much bigger impact on the final total than does the possibility of there being 1.1e5-1 bits set instead of 1.1e5. For example, consider tossing a coin twice. There is a 75% chance of getting at least one "head". Do 4 such trials. For each head, assign it a random position in 1.1e5 slots (but only one head per slot per trial). The odds of the 4 trials producing 4 heads in one of those slots (one from each trial) one might estimate as:
Where x==1.1e5. Or, 2.3e-16. But to find the accurate odds you have to realize that there is a 50% chance of one "head" (per trial) and a 25% change two "heads" (per trial). So the actual odds are found via:
Where x1==1.1e5-1 and x12==1/(1.1e5-1)+1/(1.1e5-2). That raises the odds by a factor of 2.37 to 5.6e-16. That type of magnification of the odds will be much more pronounced in the case originally discussed and I bet would result in nearly the 48-fold increase required to get to the odds of my upper bound calculation. - tye | [reply] [d/l] [select] |
Re^2: The statistics of hashing.
by JavaFan (Canon) on Apr 01, 2012 at 11:17 UTC | |
Thus with 110,000 strings, getting a collision in all four stands at 75%**4, about 30% chance. That's with 110,000 strings. You have "several billion".If, after 110,000 strings you get a 75% on a collision, 75%4 gives you the chance 4 sets of 110,000 each give you a collision. But that's nowhere near the probability the same pair gives you a collision. Which is what BrowserUK needs. It would take me too much time to dig up the math required to do the calculation. But what BrowserUK may try to do: generate a billion strings that are all unique. Run the proposed algorithm. See how many collisions are reported: they're all false positives. Repeat several times (with different strings). You may even want to calculate a confidence interval. Of course, I've no idea how feasible this is. If it takes a week to process a data set this size, you may not want to do so. | [reply] |