in reply to Re: Replace duplicate files with hardlinks
in thread Replace duplicate files with hardlinks

With those details in mind, I suspect that if you run your script repeatedly in succession on the same path, it will find/rename/replace/delete the same set of duplicate files, more or less identically, on each run.

You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

I think the *n*x notion of "symlinks" ports to MS-Windows as "short-cuts", so this may be somewhat more portable, but you'd have to look at the sources for F::F::Dups to see whether it picks up on the difference between a data file and any sort of link.

I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

And that brings up another point you might want to test with your script: does F::F::Dups know enough to leave symlinks alone, or does it follow them when looking for dups?

Luckily enough, it doesn't.
F::F::Dups uses File::Find with somewhat default options, and in that regard the default is not to follow links. So the problem that you most correctly point out is not an issue here (but thanks for mentioning it because I hadn't considered it!).

  • Comment on Re^2: Replace duplicate files with hardlinks

Replies are listed 'Best First'.
Re^3: Replace duplicate files with hardlinks
by repellent (Priest) on Aug 11, 2008 at 04:04 UTC
      I checked the source of the module, and it only reports real duplicates. Soft links are discarded by the -f file test, wich returns 0 if the "element" is a directory or a soft link.

    That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.)

    The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink.

    Hence, directories are discarded as a result of the -f file test, but not softlinks. You may be thinking that softlinks are discarded because they're pointing to directories, perhaps?

      However, I'm only now thinking about the posibility of creating soft links and the consequences that I might have. I hadn't considered the possibility of running the script in a non-unix envirnomnet either.

    I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.
      That's not quite how it works with softlinks. When you perform a file test on a softlink, think of it as performing the file test on the non-softlink target linked file (whether it be a plain file, directory, special file, etc.)

      The only file test that is applicable to the softlink inode itself is the -l operator. Purpose: to find out if it's a softlink.

      Oops, my bad! Thanks for clarifying. I got that false information from a web page (I may have misunderstood it). I'll update the code to check the inodes with -l and skip those that return true.

      I would recommend creating softlinks instead of hardlinks. It's more apparent. But then you need to decide which inode of the duplicates becomes the softlinks' target.

      Exactly. And since a priori I don't know where the original file would have to be, I'd rather just create a hard link. The problem with soft links is that later you could get into trouble if you delete the original file; then all the links that pointed to it would break. With hard links this is not an issue, since the actual information is not lost until the last link that points to it is removed.

      I like how creating hard links seamlessly reduces disk usage while (nearly) not changing anything else. The only drawback that I see is that if you change a file, all the linked ones also change, and there may be situations in which you would not want that. But this is true for soft links either.

      Maybe it would be nice to take an argument from the command line to let the user choose whether to create a soft link, a hard link, delete the duplicates or just report their existence.

Re^3: Replace duplicate files with hardlinks
by repellent (Priest) on Aug 11, 2008 at 16:28 UTC
      You are right. At first I thought it was a bug in my script, but then I realized that, as there is no way of recognizing a hard link as such, repeated runs of the program on the same directory will report identical results.

    I'd like to add that you can distinguish hardlinks by inode numbers. When you have your group of duplicate-content files, hash them by inode numbers:
    push @{ $hash{$inode} }, $path;
    When you have more than one key in the hash, decide which hardlink's inode you like and link the other paths to it.

    When you have only one key in the hash, you're done!
Re^3: Replace duplicate files with hardlinks
by bruno (Friar) on Aug 10, 2008 at 22:16 UTC
    Again, me, sorry.