comment on

As mentioned above, those errors come from this line (which happens to be line #23, in case you were wondering about that):

$command_line = "cmp -l /media/hda3/2967test/"$data[$i]" /media/hda3/2
+967test/"$data[$j]"| wc -l";
[download]

To be syntactically correct, that would have to be something like this (I'm adding a little more simplification):

my $path = "/media/hda3/2967test";
$command_line = "cmp -l $path/$data[$i] $path/$data[$j] | wc -l";
[download]

(updated: added "$" to "j")

But once you fix that, then you'll have a problem with this line:

$diff_count = system $command_line;
[download]

The "system()" function only returns the exit status (zero for success, non-zero for any sort of failure). You want to put $command_line inside back-ticks or use the "qx" operator like this: $diff_count = qx{ $command_line }; (and then you may want to "chomp $diff_count", because it will include a final line-feed character at the end of the numeric string value.

But depending on what you are really trying to accomplish, you might want to consider using Digest::MD5 (or the *n*x "md5sum" command): just get the md5 signature for each of the files; any two files that have the same signature (and are the same size) are very likely to have identical content.

To be really careful about finding duplicate files, you may want to build a hash of arrays, with strings made up of "$md5sig $filesize" as the hash keys, and file names having identical sizes and md5s as the array elements stored at that key. Then you can run "cmp" on just the sets of files in a given array. That will reduce the amount of work to do (and the amount of printed output to review) by a lot.

update: I forgot to mention that in the code you posted, you don't actually open the input FILE handle (or get the file name for input -- maybe you meant to use the "magic diamond" operator? Anyway, here's how it might look, using the Digest::MD5 module:

#!/usr/bin/perl

use strict;
use Digest::MD5 qw/md5_base64/;

@files = <>;  # read list of file names (from @ARGV or STDIN)
chomp @files;

my %sigs;
for my $file ( @files ) {
    local $/;  # use "slurp" input mode: whole file in 1 read
    open( I, "<", $file ) or do {
        warn "$file: $!\n";
        next;
    }
    $_ = <I>;
    close I;
    my $siz = -s $file;
    my $sig = md5_base64( $_ );
    push @{$sigs{"$sig $siz"}, $file;
}

# now check for possible duplicate files
my $path = "/media/hda3/2967test";
for my $sig ( grep { @{$sigs{$_}} > 1 } keys %sigs ) {
    my @files = @{$sigs{$sig}};
    for my $i ( 1 .. $#files ) {
        for my $j ( 0 .. $i-1 ) {
            $diff = `cmp $path/$files[$i] $path/$files[$j] | wc -l`;
            print "$files[$i] - $files[$j] are duplicates\n"
               if ( $diff =~ /^\s*0/ );
        }
    }
}
[download]

(not tested, but updated to add some missing sigils and braces in the "push", "grep" and "my @files = " lines)

In reply to Re: Script to Find Differences Between Files by graff
in thread Script to Find Differences Between Files by lunchb0x

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.