Re: Script to Find Differences Between Files

As mentioned above, those errors come from this line (which happens to be line #23, in case you were wondering about that):

$command_line = "cmp -l /media/hda3/2967test/"$data[$i]" /media/hda3/2
+967test/"$data[$j]"| wc -l";
[download]

To be syntactically correct, that would have to be something like this (I'm adding a little more simplification):

my $path = "/media/hda3/2967test";
$command_line = "cmp -l $path/$data[$i] $path/$data[$j] | wc -l";
[download]

(updated: added "$" to "j")

But once you fix that, then you'll have a problem with this line:

$diff_count = system $command_line;
[download]

The "system()" function only returns the exit status (zero for success, non-zero for any sort of failure). You want to put $command_line inside back-ticks or use the "qx" operator like this: $diff_count = qx{ $command_line }; (and then you may want to "chomp $diff_count", because it will include a final line-feed character at the end of the numeric string value.

But depending on what you are really trying to accomplish, you might want to consider using Digest::MD5 (or the *n*x "md5sum" command): just get the md5 signature for each of the files; any two files that have the same signature (and are the same size) are very likely to have identical content.

To be really careful about finding duplicate files, you may want to build a hash of arrays, with strings made up of "$md5sig $filesize" as the hash keys, and file names having identical sizes and md5s as the array elements stored at that key. Then you can run "cmp" on just the sets of files in a given array. That will reduce the amount of work to do (and the amount of printed output to review) by a lot.

update: I forgot to mention that in the code you posted, you don't actually open the input FILE handle (or get the file name for input -- maybe you meant to use the "magic diamond" operator? Anyway, here's how it might look, using the Digest::MD5 module:

#!/usr/bin/perl

use strict;
use Digest::MD5 qw/md5_base64/;

@files = <>;  # read list of file names (from @ARGV or STDIN)
chomp @files;

my %sigs;
for my $file ( @files ) {
    local $/;  # use "slurp" input mode: whole file in 1 read
    open( I, "<", $file ) or do {
        warn "$file: $!\n";
        next;
    }
    $_ = <I>;
    close I;
    my $siz = -s $file;
    my $sig = md5_base64( $_ );
    push @{$sigs{"$sig $siz"}, $file;
}

# now check for possible duplicate files
my $path = "/media/hda3/2967test";
for my $sig ( grep { @{$sigs{$_}} > 1 } keys %sigs ) {
    my @files = @{$sigs{$sig}};
    for my $i ( 1 .. $#files ) {
        for my $j ( 0 .. $i-1 ) {
            $diff = `cmp $path/$files[$i] $path/$files[$j] | wc -l`;
            print "$files[$i] - $files[$j] are duplicates\n"
               if ( $diff =~ /^\s*0/ );
        }
    }
}
[download]

(not tested, but updated to add some missing sigils and braces in the "push", "grep" and "my @files = " lines)

Comment on Re: Script to Find Differences Between Files Select or Download Code