leocharre has asked for the wisdom of the Perl Monks concerning the following question:

I've run into something that's messing with my head.

I have two identical files on two posix machines, for which the cli md5sum both return the same digest.. but.. if I use Digest::MD5::md5_hex() instead, I get different values. Grrr..

What's up? Is it the version of the module or is it the system architecture? Is md5_hex not reliable? Will this keep changing over time.. (oh please no)

I gotta figure this out- I am using md5sum as authority to index files accross the network from multiple machines at once.. :-/ I gotta pick the one that sticks and maybe do some funny things to some tables..Any hint is appreciated!

  • Comment on Digest::MD5::md5_hex giving different values on different machines

Replies are listed 'Best First'.
Re: Digest::MD5::md5_hex giving different values on different machines
by BrowserUk (Patriarch) on Jun 20, 2007 at 16:32 UTC

    To further your cause for information, it might help if you showed Perl/Digest::MD5 and md5sum producing different results. Eg. c&p the shell output.

    That would show which combination is producing the wrong results.

    Given sisyphus' recent problems with using integers on 64-bit platforms, one possibility is the conversion of the 128-bit integer to hex there has issues.

    Another possibility is that you aren't binmodeing the file before reading it. A more common mistake on win32, but with the advent of unicode, maybe that is affecting the results?

    Example: The first perl command doesn't binmode the file, the second does:

    C:\test>perl -MDigest::MD5=md5_hex -wle"print md5_hex( do{local $/; <> } )" 1Mx4096.db 22502b12bc292ae0a0aa1f4a33942662 C:\test>perl -MDigest::MD5=md5_hex -wle"open I, '<:raw', $ARGV[ 0 ]; print md5_hex(do{local $/; <I>})" 1M +x4096.db 66bff1dfd44db7d4402171056d494b2d C:\test>md5sum 1Mx4096.db 66bff1dfd44db7d4402171056d494b2d *1Mx4096.db

    Update: As ikegami alludes to below, in the light of Using ":raw" layer in open() vs. calling binmode(), whether it's broken documentation or a bug in PerlIO, it seems that you need '<:raw:perlio' to achieve the same as binmode via open, which is important if performance is a consideration.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      You should probably be using binmode...

      In fact, that could very well be the problem is one of the machines is a Windows machine and the other one isn't.

Re: Digest::MD5::md5_hex giving different values on different machines
by Joost (Canon) on Jun 20, 2007 at 18:56 UTC
    last I checked (a few days ago) md5sum and Digest::MD5's md5_hex gave the same result on RHEL5/x86_64.

    Checking right now md5sum and Digest::MD5's md5_hex give the same result on my debian linux / x86_32.

    I can't compare the 64 bit result against the 32 bit one at the moment, though.

    update: and yeah, some example data and code would probably help us help you. For all I know you're just passing Digest::MD5 the filename instead of the file's content.

Re: Digest::MD5::md5_hex giving different values on different machines
by Cabrion (Friar) on Jun 21, 2007 at 12:31 UTC
    I ran into this several times and discovered every time that what I was hashing wasn't what I thought I was hashing. In one case, I was using a concatenated hash, which in perl, does not always get stored in the same order. The second time I was concatenating fields returned from a database and figured out that concatenating nulls without a field separator would return unexpected results. The third time, I was inadvertently ref()'ing a variable and therefore the memory address of the glob was part of the input and would change from invocation to invocation.

    My point: It's probably not the module, it's your input. Capture the raw inputs and compare them byte-by-byte and you will find your answer.

Re: Digest::MD5::md5_hex giving different values on different machines
by EvanCarroll (Chaplain) on Jun 20, 2007 at 16:34 UTC
    Every now and then formulas set in stone get up and change on you. That's part of life.
    If what you speak is true, the file would be very appropriate, and a bug report on Digest::MD5, check the build notes between v2.33-v2.36 and see if this issue was addressed.


    Evan Carroll
    www.EvanCarroll.com
Re: Digest::MD5::md5_hex giving different values on different machines
by Fletch (Bishop) on Jun 20, 2007 at 16:34 UTC

    More information might help; specifically:

    What values match the md5sum value? Does the newer Digest::MD5 on A agree with md5sum, or does B match?

Re: Digest::MD5::md5_hex giving different values on different machines
by dsheroh (Monsignor) on Jun 22, 2007 at 15:14 UTC
    Lots of good possibilities already mentioned that I wouldn't have thought of, but don't forget a very simple one:

    Do both data sources have (or both not have) a trailing newline?

    The first time I used MD5 for storing passwords, I spent a while banging my head against the wall trying to figure out why the same test passwords would generate one hash in the command-line tools and a different one from Perl. Appending a "\n" to the password before passing it to md5_hex fixed that right up. (Not that this is terribly likely in your case, since you appear to be working from a file rather than a plain string, but I'll mention it for completeness.)

Re: Digest::MD5::md5_hex giving different values on different machines
by ftumsh (Scribe) on Jun 21, 2007 at 09:05 UTC
    Differing files may give the same md5, just to complicate things... http://www.cits.rub.de/MD5Collisions/
      Differing files may give the same md5.

      For any hash function, there will be collisions, infinite collisions in fact. (Informal proof: Once you've hashed as many distinct files as you have total hash values, if you hash yet another distinct file, you must have a collision somewhere.)

      The key here is a Cryptographic Hash Function.

      Are you trying to detect accidental differences or malicious ones?

      MD5 is very good at detecting the former with high confidence. Given the published research, it isn't as good for the later.

Re: Digest::MD5::md5_hex giving different values on different machines
by Anonymous Monk on Jun 27, 2007 at 20:27 UTC
    What is also possible is that the files actually *are* changing. If this is a multi-user environment, it is possible the files are being changed. They may even *look* the same, but *aren't*. You said yourself you are indexing multiple machines, multiple documents. I have had a similar situation with pdf documents. I had two copies of the same document, same filename, I opened them and had the same content. These files were hard copy papers scanned in via an Ikon scanner, turned into pdf documents. It turns out.. Somebody made a mistake! And scanned the same document twice! And named it the same! md5sum was telling me these were different documents, and I was refusing to believe it beacuse my eyes told me different. Watch out for this possibility.