lunchb0x has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks: I am trying to write a script to do the following: 1. Read a list of files into an array from a file 2. Compare each file in the array to one another using cmp 3. Print the results to a file. This is my attempt so far... #!/usr/bin/perl -w #open file open(OUTFILE, ">/media/hda3/differences.txt") or die("Unable to open f +ile"); #read file into array @data = <FILE>; #close file close (FILE); #Used to test if data is read into array #print $data[2]; #open file to save differences open(FILE, "/media/hda3/differences.txt"); #loop to determine differences for ($i=0; $i<377; $i++) { for($j=$i+1; $j<377; $j++) { $command_line = "cmp -l /media/hda3/2967test/"$data[$i]" /media/h +da3/2967test/"$data[$j]"| wc -l"; $diff_count = system $command_line; print OUTFILE ($data[$i], "\t", $date[$j], "\t", $diff_count, "\n +"); } } #close file close(FILE); These are the error msgs that I receive Scalar found where operator expected at ./differ.pl line 23, near ""cm +p -l /media/hda3/2967test/"$data" (Missing operator before $data?) String found where operator expected at ./differ.pl line 23, near "]" +/media/hda3/2967test/"" (Missing operator before " /media/hda3/2967test/"?) Scalar found where operator expected at ./differ.pl line 23, near "" / +media/hda3/2967test/"$data" (Missing operator before $data?) String found where operator expected at ./differ.pl line 23, near "]"| + wc -l"" (Missing operator before "| wc -l"?) syntax error at ./differ.pl line 23, near ""cmp -l /media/hda3/2967tes +t/"$data" Execution of ./differ.pl aborted due to compilation errors. I am having a hard time debugging this. Any ideas what I am doin +g wrong here? Is there a better way to do this? Thank you for your +advice. Sincere Seeker of Perl Wisdom, Lunchb0x

Replies are listed 'Best First'.
Re: Script to Find Differences Between Files
by graff (Chancellor) on Jul 21, 2007 at 18:51 UTC
    As mentioned above, those errors come from this line (which happens to be line #23, in case you were wondering about that):
    $command_line = "cmp -l /media/hda3/2967test/"$data[$i]" /media/hda3/2 +967test/"$data[$j]"| wc -l";
    To be syntactically correct, that would have to be something like this (I'm adding a little more simplification):
    my $path = "/media/hda3/2967test"; $command_line = "cmp -l $path/$data[$i] $path/$data[$j] | wc -l";
    (updated: added "$" to "j")

    But once you fix that, then you'll have a problem with this line:

    $diff_count = system $command_line;
    The "system()" function only returns the exit status (zero for success, non-zero for any sort of failure). You want to put $command_line inside back-ticks or use the "qx" operator like this:  $diff_count = qx{ $command_line }; (and then you may want to "chomp $diff_count", because it will include a final line-feed character at the end of the numeric string value.

    But depending on what you are really trying to accomplish, you might want to consider using Digest::MD5 (or the *n*x "md5sum" command): just get the md5 signature for each of the files; any two files that have the same signature (and are the same size) are very likely to have identical content.

    To be really careful about finding duplicate files, you may want to build a hash of arrays, with strings made up of "$md5sig $filesize" as the hash keys, and file names having identical sizes and md5s as the array elements stored at that key. Then you can run "cmp" on just the sets of files in a given array. That will reduce the amount of work to do (and the amount of printed output to review) by a lot.

    update: I forgot to mention that in the code you posted, you don't actually open the input FILE handle (or get the file name for input -- maybe you meant to use the "magic diamond" operator? Anyway, here's how it might look, using the Digest::MD5 module:

    #!/usr/bin/perl use strict; use Digest::MD5 qw/md5_base64/; @files = <>; # read list of file names (from @ARGV or STDIN) chomp @files; my %sigs; for my $file ( @files ) { local $/; # use "slurp" input mode: whole file in 1 read open( I, "<", $file ) or do { warn "$file: $!\n"; next; } $_ = <I>; close I; my $siz = -s $file; my $sig = md5_base64( $_ ); push @{$sigs{"$sig $siz"}, $file; } # now check for possible duplicate files my $path = "/media/hda3/2967test"; for my $sig ( grep { @{$sigs{$_}} > 1 } keys %sigs ) { my @files = @{$sigs{$sig}}; for my $i ( 1 .. $#files ) { for my $j ( 0 .. $i-1 ) { $diff = `cmp $path/$files[$i] $path/$files[$j] | wc -l`; print "$files[$i] - $files[$j] are duplicates\n" if ( $diff =~ /^\s*0/ ); } } }
    (not tested, but updated to add some missing sigils and braces in the "push", "grep" and "my @files = " lines)
Re: Script to Find Differences Between Files
by swampyankee (Parson) on Jul 21, 2007 at 18:26 UTC

    For the first one, it looks like you've got problems with your quotes; you either need to insert the concatenation operator (.) or omit the quotes before and after $data[$i]. Perl's quoting rules have double quotes (") causing variable expansion, and single quotes (') not. Take a look at this for edification.


    emc

    Information about American English usage here and here.

    Any New York City or Connecticut area jobs? I'm currently unemployed.

Re: Script to Find Differences Between Files
by jwkrahn (Abbot) on Jul 21, 2007 at 19:21 UTC
    1 #!/usr/bin/perl -w You should use the strict pragma as well: use strict; 3 #open file 4 open(OUTFILE, ">/media/hda3/differences.txt") or die("Unable t +o open file"); 5 6 #read file into array 7 @data = <FILE>; The filehandle FILE has not been opened yet so @data will be filled wi +th nothing; 9 #close file All of the comments up to now have been superfluous. 10 close (FILE); 11 12 #Used to test if data is read into array 13 #print $data[2]; How is that supposed to "test if data is read into array"? 15 #open file to save differences 16 open(FILE, "/media/hda3/differences.txt"); You are opening the file too late to be able read it on line 7. You s +hould *always* verify that the file opened correctly: open FILE, '<', '/media/hda3/differences.txt' or die "Cannot o +pen '/media/hda3/differences.txt' $!"; 18 #loop to determine differences 19 for ($i=0; $i<377; $i++) 20 { 21 for($j=$i+1; $j<377; $j++) 22 { Why 377? Why not just use the actual size of the array: for my $i ( 0 .. $#data ) { for my $j ( $i + 1 .. $#data ) { 23 $command_line = "cmp -l /media/hda3/2967test/" $data[$i] +" /media/hda3/2967test/" $data[$j] "| wc -l"; There should be operators between the strings and the variables: $command_line = "cmp -l /media/hda3/2967test/" . $data[ $ +i ] . " /media/hda3/2967test/" . $data[ $j ] . " | wc -l"; Or just use string interpolation: $command_line = "cmp -l /media/hda3/2967test/$data[$i] /m +edia/hda3/2967test/$data[$j] | wc -l"; 24 $diff_count = system $command_line; system does not return what you seem to think it returns. You probabl +y want something like: my $command_line = "cmp -l /media/hda3/2967test/$data[$i] + /media/hda3/2967test/$data[$j]"; my $diff_count = () = `$command_line`; 25 print OUTFILE ($data[$i], "\t", $date[$j], "\t", $diff_co +unt, "\n"); The strict pragma would have caught your typo, there is no @date array + defined. 26 } 27 } 28 29 #close file 30 close(FILE);