in reply to Re: compare images
in thread compare images

Hash functions like MD5 can have collisions. This means two different files can have the same MD5 checksum

You're right, they can. Indeed, I swear I actually saw this once when generating md5s from 1,000,000 web pages; but I never suceeded in reproducing it. People who understand statistics tell me that the likehood is extremely low. Like so low (unless you delibertely set out to achieve it), that hell is likely to freeze over first--or something like that :)

If you ever actually encounter two real files with the same md5s, and they are not proprietory or private, could you let me have a copy of each. I have some analysis code I would like to run on them.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: compare images
by superfrink (Curate) on Mar 11, 2006 at 07:34 UTC
    It is supposed to be very rare. I don't know how rare. I don't think I have ever seen it happen by accident. At least I have never noticed. Even though it is rare I try to avoid programming with the view "that probably won't ever cause problems".

    I was curious so I googled and found two postscript files with the same MD5 hash. To be fair someone did set out to generate two files with the same MD5 checksum.

      Thanks. That is really intriguing. Did you look at those two files as plain text as well as via ghostscript or similar? I'm not familiar enough with postscript to understand how the 5 characters changed in the binary bit at the top can cause two otherwise identical documents to appear so different when formatted? Kind of reenforces my distaste for non-plain text communications mediums.

      Even though it is rare I try to avoid programming with the view "that probably won't ever cause problems".

      Agreed. The 'problem' with the MD5 hash, and all other hashes for that matter, are applications that use them under the assumption that either clashes cannot happen, or are so rare that there is no need to verify them. Especially for security/cryptography applications.

      The assumption that any digest/hash function that can represent any size document of file with a short, fixed length 'unique' signature is mathematically impossible (a bit like infinite lossless compression :), and any security application that relies on that in just plain broken.

      About the best you can do is compute two more different digests of the document which should make it much, much harder to generate two disperate, but meaningful documents that produce the same digests through the different hash functions.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Did you look at those two files as plain text as well as via ghostscript or similar? I'm not familiar enough with postscript to understand how the 5 characters changed in the binary bit at the top can cause two otherwise identical documents to appear so different when formatted?
        I looked -- actually there are 7 bytes that differ:
        byte value in value in xor offset "letter" "order" diff -------------------------------- 0x53 97 17 80 0x6d a3 23 80 0x6e 78 79 01 0x7b 5a da 80 0x93 c8 48 80 0xad d8 58 80 0xbb 6f ef 80
        Obviously there's some clue here about how md5 actually works. As for how this example was implemented, I can easily imagine putting both texts into the single .ps file, and using some binary differences at the top to toggle between ignoring (i.e. not displaying) one text or the other.
        Kind of reenforces my distaste for non-plain text communications mediums.
        Amen to that.
        blockquote>About the best you can do is compute two more different digests of the document which should make it much, much harder to generate two disperate, but meaningful documents that produce the same digests through the different hash functions.

        As far as I know, most digests (MD4, MD5, SHA1...) are relying upon the same mathematical rules and concepts (such as Galois groups). As virtually all of these digest systems have be "broken", it may be conceivable to forge two different files sharing more than one digest type !

Re^3: compare images
by wazoox (Prior) on Mar 11, 2006 at 10:22 UTC
    Some people actually have developed method to create MD5 collisions. See :http://www.cits.rub.de/MD5Collisions/, there you'll find 2 very different postscript files sharing the same MD5.
    There were interesting dicussions about this on Bruce Schneier blog and in his Crypto-gram newsletter, see : http://www.schneier.com/

      Thanks. See also 535916


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.