The more URLs you hash, the higher the probability for a collision is.
This is true for all hashing algorithm.
I personally would take the (relative small) risk for less than 10^4 hashes, with more hashes I'd rely on something else or implement a scheme that makes hash collisions non-fatal if I had to use hashes.
| [reply] [Watch: Dir/Any] |
It produces an output of fixed length for an input of arbitrary length. So there is an infinit set of possible inputs mapped to a finit set of checksums - so the algorithm can't produce a unique checksum for every url you feed it. I suggest you read the article on wikipedia and then have a look at SHA1 as possibly the better solution - that nonetheless will not be bijective either (It will produce collisions!) - it depends on your problem at hand if this is a hindrance.
regards,
tomte
An intellectual is someone whose mind watches itself. -- Albert Camus
| [reply] [Watch: Dir/Any] [d/l] |
Currently I am using MD5 as digest, but with lots of urls the data structure is growing big.
So I thought about reducing the bits per url and using adler32 instead.
BTW: I am implementing a url-seen structure here and need the hash to check against, while minimizing false positives/negatives.
| [reply] [Watch: Dir/Any] |
Looking up Adler32 brought the following to light:
Weakness
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol.
As most urls are considerably shorter than 128 bytes, that suggests that Adler32 is not a good choice for your application and that you would better consider the CRC32 algorithm.
Whichever algorithm you use there is going to be some chance of false positives. However, you can reduce these chances by a considerable margin by using two different hashing algorithms. If you were to calculate and store both the CRC32 and the Alder32 for each url, the odds of a simultaneous collision for both hashes for any given pair of urls is vastly reduced.
Of course that means storing twice as much information which is a part of your original problem. However, there is a way of storing both sets of hash data such that it requires minimal memory (10kb or so) whilst giving almost the same lookup performance (15 microsecs/lookup compared to 5 microsecs) as Perl's hashes.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
How many url's are you hoping to deal with?
How are you storing your checksums for lookup? In a hash?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
A final word. Don't use Alder32 (alone) for this purpose!
Running Adler, CRC32 and both on several sets of 1 million randomly generated url-like strings ranging from 16 to 128 characters in length, Adler produced duplicates in ~1% of cases; CRC32 produced ~0.2%; and in several runs the combination of both found just 2 duplicates (circa. 0.002% but not enough samples to be judged representative).
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |