in reply to Find duplicate files with exact same files noted

my @file_list; sub files_wanted { my $text = $File::Find::name; if ( -f ) { push @file_list, $text; } } #If you set a base directory above, you will need to change $directory to $base_directory.$directory. find(\&files_wanted,$directory); #This section creates a hash of arrays of files, with the hash keys be +ing filename.ext and the file +size in #parentheses. The raw file name is the entire path including the file +name. my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

Why tranverse the directory tree twice (and stat each file twice) when you only have to traverse it once:

my %files; find sub { if ( -f ) { push @{ $files{ "$_ (" . ( -s _ ) . " bytes)" } }, $File::Find +::name; } }, $directory;

Replies are listed 'Best First'.
Re^2: Find duplicate files with exact same files noted
by Lady_Aleena (Priest) on Aug 17, 2010 at 19:18 UTC

    Wow! I didn't realize that I was traversing the tree twice until you said something. Maybe that is why it took a little while to run. I didn't use your exact suggestion, but I did merge the two pieces into one.

    This ...

    my @file_list; sub files_wanted { my $text = $File::Find::name; if ( -f ) { push @file_list, $text; } } find(\&files_wanted,$directory); my %files; for my $raw_file (@file_list) { my @file_parts = split(/\//,$raw_file); my $file = pop @file_parts; my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; }

    .. is now this ...

    my %files; sub files_wanted { my $raw_file = $File::Find::name; if ( -f ) { my ($volume,$directories,$file) = File::Spec->splitpath($raw_file) +; #update from a prior suggestion. my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; } } find(\&files_wanted,$directory);

    The script now runs a little faster since removing the double traversal of the directory tree. Thanks for showing me what I was really doing!

    Have a cookie and a very nice day!
    Lady Aleena
      my %files; sub files_wanted { my $raw_file = $File::Find::name; if ( -f ) { my ($volume,$directories,$file) = File::Spec->splitpath($raw_file) +; #update from a prior suggestion. my $file_size = -s $raw_file; push @{$files{"$file ($file_size bytes)"}}, $raw_file; } }

      While you are in the "wanted" subroutine that File::Find::find runs, the full path is in the $File::Find::name variable and the file name only is in the $_ variable so there is no need to use File::Spec->splitpath() to do something that File::Find::find has already done for you.    Also, you are still using stat on the same file twice when it would be more efficient to do it only once.