Hash functions like MD5 can have collisions. This means two different files can have the same MD5 checksum
You're right, they can. Indeed, I swear I actually saw this once when generating md5s from 1,000,000 web pages; but I never suceeded in reproducing it. People who understand statistics tell me that the likehood is extremely low. Like so low (unless you delibertely set out to achieve it), that hell is likely to freeze over first--or something like that :)
If you ever actually encounter two real files with the same md5s, and they are not proprietory or private, could you let me have a copy of each. I have some analysis code I would like to run on them.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
It is supposed to be very rare. I don't know how rare. I don't think I have ever seen it happen by accident. At least I have never noticed. Even though it is rare I try to avoid programming with the view "that probably won't ever cause problems".
I was curious so I googled and found two postscript files with the same MD5 hash. To be fair someone did set out to generate two files with the same MD5 checksum.
| [reply] |
Thanks. That is really intriguing. Did you look at those two files as plain text as well as via ghostscript or similar? I'm not familiar enough with postscript to understand how the 5 characters changed in the binary bit at the top can cause two otherwise identical documents to appear so different when formatted? Kind of reenforces my distaste for non-plain text communications mediums.
Even though it is rare I try to avoid programming with the view "that probably won't ever cause problems".
Agreed. The 'problem' with the MD5 hash, and all other hashes for that matter, are applications that use them under the assumption that either clashes cannot happen, or are so rare that there is no need to verify them. Especially for security/cryptography applications.
The assumption that any digest/hash function that can represent any size document of file with a short, fixed length 'unique' signature is mathematically impossible (a bit like infinite lossless compression :), and any security application that relies on that in just plain broken.
About the best you can do is compute two more different digests of the document which should make it much, much harder to generate two disperate, but meaningful documents that produce the same digests through the different hash functions.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Some people actually have developed method to create MD5 collisions. See :http://www.cits.rub.de/MD5Collisions/, there you'll find 2 very different postscript files sharing the same MD5.
There were interesting dicussions about this on Bruce Schneier blog and in his Crypto-gram newsletter, see : http://www.schneier.com/
| [reply] |
Thanks. See also 535916
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |