in reply to Save space with CRC

There are a lot of scripts to find duplicate files in a file system, and you might want to see what they do. Just google: perl find duplicate files, and you'll find scripts like this one:

http://www.perlmonks.org/?node_id=49198

MD5 is very popular and probably suitable for this task even though it has been found to have some weaknesses which make it undesirable for security applications. SHA1 is also a reasonable choice. Since compute power is so cheap these days, why not just use both -- just concatenate the MD5 and SHA1 hashes together for a very discriminating hash!

In any case, I'd use the length of the file as the first determinant -- that will greatly reduce the amount of comparing you have to do.

Update: It has just occurred to to me that a really slick way of doing this would be to "incrementally evaluate" the hash function, so that you could limit the amount of each file that you read from disk. The hash function could really be a composite hash consisting of:

  1. the length of the file
  2. the MD5 hash of the first 1K
  3. the MD5 hash of the first 2K
  4. ...
  5. the MD5 hash of the first 2n K.
These hash components would only be evaluated as necessary. So, if the length of the file uniquely determined the file, no MD5 hash would be computed. Of course, you can replace MD5 with the hash function of your choice, but one reason I mention it is because the Digest::MD5 module makes it easy to compute the intermediate hash values of prefixes of your input so you don't have to re-run the hash function over the previous input to add on to it.

Replies are listed 'Best First'.
Re^2: Save space with CRC
by Anonymous Monk on Dec 19, 2007 at 19:03 UTC

    Many thank,

    You right, probably this's a very good solution, becouse we dont computes all only if exists some colition or differnece, it's simple, easy and faster....

    But the only change that I will be makes, perhaps is always compute a MD5, in this case, becouse the length of file alone is not a secure method, isn't it?

    Well, many thank, big help, i'm reading too the other suggested that you tell me.

    Thk again

      If you take into account MD5 of the file in addition to some other factors, such as file size or hash of the reverse of the file, you'll be okay. Also, CRC is used to detect errors in data trasmission or storage - not for comparing two files.

        Thk perlfan too, other very well suggestion is use "hash of the reverse of the file" I dont thinking in that but is really good.

        Please note that CRC is mostly used to check transmition but in fact the CRC is a "simple" method to transform any size of input data in a fix integer size, not less, not more, you can use that to anything that you want.

        The bad think about this it's very short an can be colition.

        Thk a lot to all.