http://qs321.pair.com?node_id=618494

isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi there!

Are there any caveats in hashing http://www.somesite.com/ urls with with the adler-32 checksum algorithm?

Can I expect a unique hash for each unique url - in other words: will adler32 generate checksums that correspond to just this specific url or will it (in few cases) generate the same checksum for two different urls?

Thanks!

Replies are listed 'Best First'.
Re: Hashing urls with Adler32
by moritz (Cardinal) on May 31, 2007 at 14:39 UTC
    The more URLs you hash, the higher the probability for a collision is.

    This is true for all hashing algorithm.

    I personally would take the (relative small) risk for less than 10^4 hashes, with more hashes I'd rely on something else or implement a scheme that makes hash collisions non-fatal if I had to use hashes.

Re: Hashing urls with Adler32
by Tomte (Priest) on May 31, 2007 at 14:40 UTC

    It produces an output of fixed length for an input of arbitrary length. So there is an infinit set of possible inputs mapped to a finit set of checksums - so the algorithm can't produce a unique checksum for every url you feed it. I suggest you read the article on wikipedia and then have a look at SHA1 as possibly the better solution - that nonetheless will not be bijective either (It will produce collisions!) - it depends on your problem at hand if this is a hindrance.

    regards,
    tomte


    An intellectual is someone whose mind watches itself.
    -- Albert Camus

      Currently I am using MD5 as digest, but with lots of urls the data structure is growing big.

      So I thought about reducing the bits per url and using adler32 instead.

      BTW: I am implementing a url-seen structure here and need the hash to check against, while minimizing false positives/negatives.
Re: Hashing urls with Adler32
by BrowserUk (Patriarch) on May 31, 2007 at 18:17 UTC

    Looking up Adler32 brought the following to light:

    Weakness

    Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol.

    As most urls are considerably shorter than 128 bytes, that suggests that Adler32 is not a good choice for your application and that you would better consider the CRC32 algorithm.

    Whichever algorithm you use there is going to be some chance of false positives. However, you can reduce these chances by a considerable margin by using two different hashing algorithms. If you were to calculate and store both the CRC32 and the Alder32 for each url, the odds of a simultaneous collision for both hashes for any given pair of urls is vastly reduced.

    Of course that means storing twice as much information which is a part of your original problem. However, there is a way of storing both sets of hash data such that it requires minimal memory (10kb or so) whilst giving almost the same lookup performance (15 microsecs/lookup compared to 5 microsecs) as Perl's hashes.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Hashing urls with Adler32
by BrowserUk (Patriarch) on May 31, 2007 at 16:51 UTC
Re: Hashing urls with Adler32
by BrowserUk (Patriarch) on Jun 01, 2007 at 08:28 UTC

    A final word. Don't use Alder32 (alone) for this purpose!

    Running Adler, CRC32 and both on several sets of 1 million randomly generated url-like strings ranging from 16 to 128 characters in length, Adler produced duplicates in ~1% of cases; CRC32 produced ~0.2%; and in several runs the combination of both found just 2 duplicates (circa. 0.002% but not enough samples to be judged representative).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.