There are a lot of scripts to find duplicate files in a file system, and you might want to see what they do. Just google:
perl find duplicate files, and you'll find scripts like this one:
http://www.perlmonks.org/?node_id=49198
MD5 is very popular and probably suitable for this task even though it has been found to have some weaknesses which make it undesirable for security applications. SHA1 is also a reasonable choice. Since compute power is so cheap these days, why not just use both -- just concatenate the MD5 and SHA1 hashes together for a very discriminating hash!
In any case, I'd use the length of the file as the first determinant -- that will greatly reduce the amount of comparing you have to do.
Update: It has just occurred to to me that a really slick way of doing this would be to "incrementally evaluate" the hash function, so that you could limit the amount of each file that you read from disk. The hash function could really be a composite hash consisting of:
- the length of the file
- the MD5 hash of the first 1K
- the MD5 hash of the first 2K
- ...
- the MD5 hash of the first 2n K.
These hash components would only be evaluated as necessary. So, if the length of the file uniquely determined the file, no MD5 hash would be computed. Of course, you can replace MD5 with the hash function of your choice, but one reason I mention it is because the
Digest::MD5 module makes it easy to compute the intermediate hash values of prefixes of your input so you don't have to re-run the hash function over the previous input to add on to it.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.