Greetings, Monks!

This is my first post here, so please feel free to redirect this to any other section if this is not the place where it belongs.

I am posting this little script seeking for your opinions in every aspect: design, layout, readability, speed, etc. It uses File::Find::Duplicates to find duplicate files recursively in a directory and, instead of just informing about them or deleting them, it creates hardlinks so that the disk space is freed but the files do remain. I wrote it to practice some of the things that I'm trying to learn, but I found it quite useful for my /home directory (I could free 2 GB!).

I thought that creating a hard link might be a better idea than deleting the file, as sometimes one wants a certain file to be under a certain path.
It also helped me to find severe redundancies in some "dot directories". For instance, in a couple of icon packages, ~30% were duplicates with different names. In this case deleting them would have ruined the icon set, but creating hard links both freed space and kept the package functional.

I was also pleasantly surprised that it is quite fast. I haven't benchmarked it (I haven't read the Benchmark documentation yet), but it is sensibly faster than, for example, the fdupes program that comes with Ubuntu (and probably other Debian-based distros).
Of course the merit of this goes entirely to Tony Bowden, the author of the module.

Here's the code for it:

#!/usr/bin/perl -w use strict; use File::Find::Duplicates; use File::Temp (); my %stats = ( files_linked => 0, space_saved => 0 ); local $" = "\n"; # Read directory from command line, or default to current. my $directory = $ARGV[0] || "."; # Find duplicates recursively in such directory my @dupes = find_duplicate_files($directory); # For each set of duplicate files, create the hardlinks and save the # information in the stats hash foreach my $set (@dupes) { print $set->size, " bytes each:\n", "@{ $set->files }\n"; my $original = shift @{ $set->files }; my $number_linked = fuse( $original, \@{ $set->files } ); $stats{files_linked} += $number_linked; $stats{space_saved} += $number_linked * $set->size; } # Report the stats print "Files linked: $stats{ files_linked }\n"; print "Space saved: $stats{ space_saved } bytes\n"; sub fuse { # Replace duplicates with hard links and return the number # of links created. my $original = shift; my $duplicates = shift; my $files_linked; foreach my $duplicate (@$duplicates) { # Step 1: link original to tempfile my $tempfile = File::Temp::tempnam( $directory, 'X' x 6 ); link $original, $tempfile or next; # Step 2: move tempfile to duplicate unless ( rename $tempfile, $duplicate ) { next; } if ( -e $tempfile ) { unlink $tempfile or die "Couldn't delete temporary file $tempfile: $!"; } ++$files_linked; } return $files_linked; }

Update: Subrutine fuse() changed following betterworld's suggestion.

Update 2: Link filtering, soft link / remove support and documentation. Here

In reply to Replace duplicate files with hardlinks by bruno

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.