Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^5: Assistance with file compare

by gmargo (Hermit)
on Oct 28, 2009 at 20:23 UTC ( [id://803789]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Assistance with file compare
in thread Assistance with file compare

Well, there's your answer right there: just compare the file sizes. Then only compare (or md5) files whose sizes match. As long as your files aren't all the same length, that could be the fastest.

Replies are listed 'Best First'.
Re^6: Assistance with file compare
by ikegami (Patriarch) on Oct 28, 2009 at 20:59 UTC

    just compare the file sizes. Then only compare (or md5) files whose sizes match.

    Not quite. If you need to be absolutely sure the files are identical, the following are effecient ways of achieving this:

    1. Identify files with the same file size.
    2. Of the files with the same file size, identify the files which are identical.

    or

    1. Identify the files with the same hash.
    2. Of the files with the same hash, identify the files which are identical.

    or

    1. Identify files with the same file size.
    2. Of the files with the same file size, identify the files with the same hash.
    3. Of the files with the same file size and hash, identify the files which are identical.

    If you're dealing with many files, the second method is probably the best.
    If you're dealing with just a few files, the first method is probably better.

      I would think that getting the file size is faster than computing the hash for the file. So it seems to me that pruning the list of files for which hashes have to be computed by comparing file sizes would be faster, especially for large numbers of files.

      I am curious to know why your second method is better for many files. Could you enlighten me please?

        I would think that getting the file size is faster than computing the hash for the file.

        You shouldn't be doing either. It should have been done for free when the file was written.

        If you didn't, you could compare files in a clever order and calculate their hash as they are being compared. This may save you from having to do more compares.

        So it seems to me that pruning the list of files for which hashes have to be computed by comparing file sizes would be faster, especially for large numbers of files.

        As the number of files grows, the number of collisions in file size grows.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://803789]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2024-03-28 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found