in reply to Re^4: [OT] The statistics of hashing. (4<10)
in thread [OT] The statistics of hashing.
If you could distill your writeup to a formula, that would be helpful.
I already did that for the most useful and most accurate calculation (as well as for other parts):
It is just $prod += $odds$prod*$odds (and that last term doesn't matter for a very long time). Just to be clear, $odds is just $b1*$b2*$b3*$b4/2**(32*4) where $b1 is the number of bits set in the first vector.
Looking at what is not explicit there, the only thing I find moderately difficult to determine is that you can start with $prod initialized to 0. The value being calculated is the probability (a value between 0 and 1) that there has been at least one 4hash collision (due to random chance) so far.
The upper bound on this value that I precalculated by discounting the impact of singlebit collisions is achieved by simply assuming $b1 through $b4 are all equal to the number of insertions done so far.
This is substantially accurate for quite a while. Making it more accurate is seriously difficult other than by computing runspecific odds as is done by the two formulae I quoted above.
I find it difficult to predict how tightly this upper bound is likely to fit against some runspecific odds when the number of inserts is a significant fraction of the number of bits per hash, as you are trying to do.
Your 10 hashvalues via XOR is not something that I can predict the results of, other than that I expect it to lie somewhere between the two very remote extremes of "no better than 4 independent hash values" and "just as good as 10 fully independent hash values". Your results so far seem to be somewhat close to what I would expect if you have 10 fully independent hash values per string.
So my calculations completely ignored the "10 hashes" addition to portion of your description as I see no way to accurately address it.
If I wanted to gain additional buckets, then I would compute the additional hash values using a (moderately) cryptographically sound process. For example, you could compute md5($string) and md5('BrowserUk'.$string) to get 8 hashes that should be substantially independent (based on prior difficult analysis of the MD5 algorithm). Though, your experiment may prove that your strategy is sufficient.
 tye


Replies are listed 'Best First'.  

Re^6: [OT] The statistics of hashing. (formula)
by BrowserUk (Pope) on Apr 01, 2012 at 21:56 UTC  
by tye (Sage) on Apr 01, 2012 at 22:17 UTC 