Re^3: MD5 - what's the alternative

a finite, large number but not infinite.

md5 hashing can be applied to inputs of any length. "Any length" means an infinite number of possible inputs. Adapting the old schoolboy definition of infinite, however many inputs you have hashed, you can always add one more byte to any one of them to get one more. Infinite.

Every input in this range of inputs will be mapped to one of the 2**128 possible md5s.

Therefore the number of inputs that will map to any given md5 is infinity / 2**128

Which is infinity.

It has often been said that there are collisions in MD5.

Of course there are. It could not be otherwise. Whenever the range of inputs is greater than the range of outputs, there will always be collisions. It simply is not possible for it to be otherwise. (Would that it were, it would make a brilliant compression algorithm:)

Theoretically, if you generated the md5s for all of the (16-byte binary encoded) integers 0 .. 2**128 -1, you might not find any duplicates. That is to say, the input range equals the output range, so it is possible that the md5 algorithm would be a 'perfect hash' for the input data.

And if you double the length of the input data to 32 bytes, then (theoretically) there will be exactly 2 inputs (plaintexts) for every output (md5).

I don't think that this "perfect hash" status has ever been confirmed (or even suggested). It would simply take too long to verify.

The example of duplicate plaintexts posted were 128 (8-bit) bytes each. 2**1024 inputs / 2**128 outputs.

There will be at least 8 duplicate, 128-byte inputs for every md5. That makes it much easier to find a duplicate.

As far as your Linux passwords are concerned, not only would the cracker need to generate a plaintext that gave the same MD5, it would also need to be of a length that the password verification process would accept as a password. Whilst accepting longer passwords generally gives greater security, that is only true up to a certain limit. With md5, that limit is 16 (8-bit) bytes.

If the plaintexts where restricted to 32 x 8-bit bytes, then there will be 2**256 inputs and 2**128 outputs. Therefore there *must be* at least 2 x 32-byte inputs that will produce every md5.

If the plaintext is restricted to 16 x 8-bit bytes, then it is possible (but not proven), that the are no duplicates.

If the inputs are restricted to 12 (8-bit bytes), then it is very unlikely that it will be duplicates. If you further restrict that to 7-bit ASCII, even less likelyhood. Exclude control characters, less again.

This is one of those cases where more isn't always better.

Actually, in my experience, more is very rarely better in computing:)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^3: MD5 - what's the alternative Select or Download Code


good chemistry is complicated, and a little bit messy -LW
	PerlMonks