Re^2: Remove Duplicate Files

Agreed. You can make it a lot more efficient by stat()ing all the files and only bothering to compare the contents of those which are the same size. Another small improvement can come from noting that those with the same device number and inode number are guaranteed to be the same so no need to compare their contents, although this may not be portable to non-Unixy platforms.

You should also be careful about how you compare symlinks and device files.

Comment on Re^2: Remove Duplicate Files Download Code

Replies are listed 'Best First'.
Re^3: Remove Duplicate Files by Anonymous Monk on Oct 29, 2004 at 09:34 UTC
And further improvement can be made by reading in just the first 1024 bytes or so, and calculate the md5 from that. Only if those match, you do a full comparison.	[reply]
Re^3: Remove Duplicate Files by gaal (Parson) on Oct 29, 2004 at 08:34 UTC
Then again, hardlinks are less of a concern for cleanup, because they don't waste disk space.	[reply]
Re^4: Remove Duplicate Files by Anonymous Monk on Oct 29, 2004 at 09:32 UTC
Well, any program that compares files and removes duplicates that doesn't look at whether they are links, will remove excess links. By looking at the inodes and device numbers to detect links, you can gain one of two things: have the option to keep links - which can be pretty useful for binaries that act different on how they are invoked, or a more speedy comparions, as you don't have to calculate the md5 hash, and then then compare the entire file.	[reply]
Re^5: Remove Duplicate Files by gaal (Parson) on Oct 29, 2004 at 09:47 UTC
Of course. I was pointing out that if the purpose of the tool was to reduce disk usage, keeping hardlinks wouldn't hurt its functionality. You are right that hardlinks can often be a good thing, but without further information about the environment this was supposed to run in, we can't tell whether leaving them is the right thing. (Probably, it's just irrelevant and ok to leave undefined.)	[reply]