Re: Replace duplicate files with hardlinks

Monks not familiar with "hard links" would need to understand the following details:

The concept of hard links applies only to unix/linux (including macosx).
Hard links only work within a given disk volume (you can't have a hard link on one disk that points to a file on another disk).
Hard links only apply to data files, not to directories or other file types (e.g. devices, symbolic links).
Creating one or more hard links to a given file is really just a matter of having more directory entries describing/pointing to that file.
Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i") **, but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links.

With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

There's nothing in the File::Find::Duplicates man page about how it determines files to be duplicates, and there is no reason to expect that it knows or cares about existing hard links (since these are not mentioned in the docs, and are OS-dependent anyway). So, existing hard links will probably look like duplicates, and will be (re)replaced on every run.

For that matter, I wonder what that module would do if you were to replace duplicate files with symbolic links instead of hard ones. I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

In any case, I tend to prefer symlinks anyway -- there tends to be less confusion when it comes to figuring out actual vs. apparent disk space usage.

And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? If the latter, you can get into various kinds of trouble, like trying to create hard links to files on different volumes (won't work) or even deleting the target of a symlink while leaving the symlink itself as the "unique version" -- which then becomes a stale link with no existing data file as the target. Note that a symlink can have a directory as its target (as well as files/directories on different disks), so if your script runs on a tree like this:

  toplevel/
     secondlevel_1/
        thirdlevel_1/
        thirdlevel_2/
            file1.dat
            file2.dat
     secondlevel_2  -> secondlevel_1/thirdlevel_2   # directory symlin
+k
[download]

will there be an apparent duplication of file1.dat and file2.dat under two different paths? If so, ~~I think~~ what is the likelihood that your script will have (or cause) some trouble?

** FOOTNOTE (UPDATE) ** Please note the very informative reply provided below by MidLifeXis. As he points out, my references to "ls -l" and "ls -i" should not be taken as implementation ideas for detecting hard links in a perl script. I mentioned these uses of "ls" merely to cite the easiest way for a person to look into the behaviors of hard links.

Comment on Re: Replace duplicate files with hardlinks Download Code

Replies are listed 'Best First'.
Re^2: Replace duplicate files with hardlinks by MidLifeXis (Monsignor) on Aug 11, 2008 at 16:55 UTC
Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i"), but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links. [emphasis added] I have a feeling that we will be speaking at different facets of the problem at hand, but when I read your response, it says to me that the program will have a hard time identifying that a file is a hard link. I would probably make clear that the program as written would have a hard time identifying the duplicates. The application could postprocess the F::F::D output and remove those files already hard linked by using the stat perl builtin. Given the device + inode + hash, you have a hardlink check. I just had the impression, even if it was not intended, that a reader of this response could come away with the feeling that you needed to poll `ls` to determine if a file was a hardlink of another. If you are interested in more detail on the hardlink stuff and how the underlying file system can implement them, see: Unix File System Modern Operating Systems^* Operating System Concepts^* ^*My college reference books on this topic are at home, the revisions have changed (as well as the covers), and my memory is, umm, rusty :). So beware, these books may not be the ones I am thinking of. --MidLifeXis	[reply] [d/l]
Re^2: Replace duplicate files with hardlinks by Anonymous Monk on Aug 10, 2008 at 22:09 UTC
With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run. You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results. I think the nx notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link. I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link. However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either. And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? Luckily enough, it doesn't. F::F::Dups uses File::Find with somewhat default options, and in that regard the default is not to follow links. So the problem that you most correctly point out is not an issue here (but thanks for mentioning it because I hadn't considered it!).	[reply]
Re^3: Replace duplicate files with hardlinks by repellent (Priest) on Aug 11, 2008 at 04:04 UTC
I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link. That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.) The only file test that is applicable to the softlink inode itself is the `-l` operator. Purpose: to find out if it's a softlink. Hence, directories are discarded as a result of the `-f file` test, but not softlinks. You may be thinking that softlinks are discarded because they're pointing to directories, perhaps? However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either. I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.	[reply] [d/l] [select]
Re^4: Replace duplicate files with hardlinks by bruno (Friar) on Aug 11, 2008 at 04:55 UTC
That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.) The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink. Oops, my bad! Thanks for clarifying. I got that false information from a web page (I may have misunderstood it). I'll update the code to check the inodes with -l and skip those that return true. I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target. Exactly. And since a priori I don't know where the original file would have to be, I'd rather just create a hard link. The problem with soft links is that later you could get into trouble if you delete the original file; then all the links that pointed to it would break. With hard links this is not an issue, since the actual information is not lost until the last link that points to it is removed. I like how creating hard links seamlessly reduces disk usage while (nearly) not changing anything else. The only drawback that I see is that if you change a file, all the linked ones also change, and there may be situations in which you would not want that. But this is true for soft links either. Maybe it would be nice to take an argument from the command line to let the user choose whether to create a soft link, a hard link, delete the duplicates or just report their existence.	[reply]
Re^3: Replace duplicate files with hardlinks by repellent (Priest) on Aug 11, 2008 at 16:28 UTC
You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results. I'd like to add that you can distinguish hardlinks by inode numbers. When you have your group of duplicate-content files, hash them by inode numbers: `push @{ $hash{$inode} }, $path;` [download] When you have more than one key in the hash, decide which hardlink's inode you like and `link` the other paths to it. When you have only one key in the hash, you're done!	[reply] [d/l] [select]
Re^3: Replace duplicate files with hardlinks by bruno (Friar) on Aug 10, 2008 at 22:16 UTC
Again, me, sorry.	[reply]