in reply to Replace duplicate files with hardlinks

Monks not familiar with "hard links" would need to understand the following details:

With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

There's nothing in the File::Find::Duplicates man page about how it determines files to be duplicates, and there is no reason to expect that it knows or cares about existing hard links (since these are not mentioned in the docs, and are OS-dependent anyway). So, existing hard links will probably look like duplicates, and will be (re)replaced on every run.

For that matter, I wonder what that module would do if you were to replace duplicate files with symbolic links instead of hard ones. I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

In any case, I tend to prefer symlinks anyway -- there tends to be less confusion when it comes to figuring out actual vs. apparent disk space usage.

And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups? If the latter, you can get into various kinds of trouble, like trying to create hard links to files on different volumes (won't work) or even deleting the target of a symlink while leaving the symlink itself as the "unique version" -- which then becomes a stale link with no existing data file as the target. Note that a symlink can have a directory as its target (as well as files/directories on different disks), so if your script runs on a tree like this:

toplevel/ secondlevel_1/ thirdlevel_1/ thirdlevel_2/ file1.dat file2.dat secondlevel_2 -> secondlevel_1/thirdlevel_2 # directory symlin +k
will there be an apparent duplication of file1.dat and file2.dat under two different paths? If so, I think what is the likelihood that your script will have (or cause) some trouble?

** FOOTNOTE (UPDATE) ** Please note the very informative reply provided below by MidLifeXis. As he points out, my references to "ls -l" and "ls -i" should not be taken as implementation ideas for detecting hard links in a perl script. I mentioned these uses of "ls" merely to cite the easiest way for a person to look into the behaviors of hard links.

Replies are listed 'Best First'.
Re^2: Replace duplicate files with hardlinks
by MidLifeXis (Monsignor) on Aug 11, 2008 at 16:55 UTC
    Once a hard link is created, you can't really identify it as such (i.e. as anything other than a plain data file). You can figure out when a given file has more than one directory entry describing/pointing to it (checking the link count shown by "ls -l"), and you can figure out which directory entries point to the same file (checking for matching inode numbers with "ls -i"), but all entries have "equal status" -- the original directory entry is simply equivalent to (i.e. one of) the hard links. [emphasis added]

    I have a feeling that we will be speaking at different facets of the problem at hand, but when I read your response, it says to me that the program will have a hard time identifying that a file is a hard link. I would probably make clear that the program as written would have a hard time identifying the duplicates.

    The application could postprocess the F::F::D output and remove those files already hard linked by using the stat perl builtin. Given the device + inode + hash, you have a hardlink check.

    I just had the impression, even if it was not intended, that a reader of this response could come away with the feeling that you needed to poll ls to determine if a file was a hardlink of another.

    If you are interested in more detail on the hardlink stuff and how the underlying file system can implement them, see:

    *My college reference books on this topic are at home, the revisions have changed (as well as the covers), and my memory is, umm, rusty :). So beware, these books may not be the ones I am thinking of.

    --MidLifeXis

Re^2: Replace duplicate files with hardlinks
by Anonymous Monk on Aug 10, 2008 at 22:09 UTC
    With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

    You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

    I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

    I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

    However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

    And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups?

    Luckily enough, it doesn't.
    F::F::Dups uses File::Find with somewhat default options, and in that regard the default is not to follow links. So the problem that you most correctly point out is not an issue here (but thanks for mentioning it because I hadn't considered it!).

        I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

      That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.)

      The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink.

      Hence, directories are discarded as a result of the -f file test, but not softlinks. You may be thinking that softlinks are discarded because they're pointing to directories, perhaps?

        However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

      I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.
        That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.)

        The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink.

        Oops, my bad! Thanks for clarifying. I got that false information from a web page (I may have misunderstood it). I'll update the code to check the inodes with -l and skip those that return true.

        I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.

        Exactly. And since a priori I don't know where the original file would have to be, I'd rather just create a hard link. The problem with soft links is that later you could get into trouble if you delete the original file; then all the links that pointed to it would break. With hard links this is not an issue, since the actual information is not lost until the last link that points to it is removed.

        I like how creating hard links seamlessly reduces disk usage while (nearly) not changing anything else. The only drawback that I see is that if you change a file, all the linked ones also change, and there may be situations in which you would not want that. But this is true for soft links either.

        Maybe it would be nice to take an argument from the command line to let the user choose whether to create a soft link, a hard link, delete the duplicates or just report their existence.

        You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

      I'd like to add that you can distinguish hardlinks by inode numbers. When you have your group of duplicate-content files, hash them by inode numbers:
      push @{ $hash{$inode} }, $path;
      When you have more than one key in the hash, decide which hardlink's inode you like and link the other paths to it.

      When you have only one key in the hash, you're done!
      Again, me, sorry.