quick and safe way to deal with this?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: quick and safe way to deal with this? by graff (Chancellor) on Aug 25, 2007 at 00:58 UTC
I tried using diff but it did't work Are you talking about the unix/gnu diff tool? How did you use it, and in what sense did it not work? Given two files, if they are different, diff will report what the differences are, and will return an exit status of 1 if the files differ, or 0 (="success") if they are identical. If you don't want to see all the lines that differ, you can use the "cmp" command instead; it gives the same exit status as diff does, but will otherwise be fairly quiet, and will will only read all of its two input files when they happen to be identical (i.e. it stops as soon as it sees a difference). Examples: `cmp file1 file2 && echo same cmp file1 file2 \|\| echo different` [download] But as someone else pointed out, if you have lots of files and you need to compare all possible pairings of files, you'll be better off computing md5 checksums for them (try the "md5sum" command or roll your own in perl with Digest::MD5), then sort by checksum string and look for duplicate checksums. (Identical checksums is not a guarantee of identical content, different checksums definitely indicate different content. You only need to run cmp on pairs that have the same checksum, to confirm whether they are really identical.)	[reply] [d/l]
Re: quick and safe way to deal with this? by BrowserUk (Patriarch) on Aug 25, 2007 at 00:34 UTC
I tried using diff but it did't work Strange, your copy must be broken because this is exactly what diff does. And it will be far faster than any perl solution, Or maybe you just used it wrong? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: quick and safe way to deal with this? by thezip (Vicar) on Aug 24, 2007 at 22:24 UTC
By just checking the file sizes, this will work a lot of the time to differentiate the two files: `#!/perl/bin/perl -w use strict; my $file1 = "2006.txt"; my $file2 = "2007.txt"; my $file1size = -s $file1; my $file2size = -s $file2; print "The files have ", $file1size == $file2size ? "the same" : "diff +erent", " size.\n\n";` [download] In what way did diff not work? Diff should be able to do exactly what you're trying to do... Where do you want them* to go today?*	[reply] [d/l]
Re: quick and safe way to deal with this? by bart (Canon) on Aug 24, 2007 at 22:19 UTC
Take an MD5 checksum for each file and compare those.	[reply]
Re^2: quick and safe way to deal with this? by kyle (Abbot) on Aug 25, 2007 at 00:08 UTC
If the files are very large and the differences are very early, you're going to spend a lot of time reading through a lot of file that you don't need to. Otherwise, this is a very easy way to find that files are different.	[reply]
Re^3: quick and safe way to deal with this? by bart (Canon) on Aug 25, 2007 at 00:28 UTC
But if the only differences are near the end, then you'll waste a lot of time going through the file a line at the time. Worse: if the files are equal, then you'll have to go through the whole file anyway. Why do people always suppose you'll commonly get positive results near the start of the loop count? Let's make a compromise: select and read a random block from both files, somewhere "in the middle", and compare. If the files are different, then you'll likely see it immediately, especially if the typical differences are in the addition or deletion of whole lines, and not replacement of single characters.	[reply]
Re^4: quick and safe way to deal with this? by kyle (Abbot) on Aug 26, 2007 at 02:46 UTC
Re: quick and safe way to deal with this? by kyle (Abbot) on Aug 25, 2007 at 00:05 UTC
This should work for any number of files, up to the maximum you can have open. `use List::Util qw( first ); my @files = ( 'FILE1', 'FILE2' ); my @handles; foreach my $file ( @files ) { open my $fh, '<', $file or die "Can't read $file: $!"; push @handles, $fh; } my $first_handle = shift @handles; while ( my $line = <$first_handle> ) { if ( defined first { $line ne <$_> } @handles ) { die "There's a difference."; } }` [download]	[reply] [d/l]
Re: quick and safe way to deal with this? by technojosh (Priest) on Aug 25, 2007 at 13:18 UTC
If this is on windows, you can use the 'fc' command to compare two files as well. Will also be faster than Perl, and does a few different things depending on how you want the files compared. From win32 command prompt: C:\>help fc Compares two files or sets of files and displays the differences betwe +en them FC [/A] [/C] [/L] [/LBn] [/N] [/OFF[LINE]] [/T] [/U] [/W] [/nnnn] [drive1:][path1]filename1 [drive2:][path2]filename2 FC /B [drive1:][path1]filename1 [drive2:][path2]filename2 /A Displays only first and last lines for each set of differ +ences. /B Performs a binary comparison. /C Disregards the case of letters. /L Compares files as ASCII text. /LBn Sets the maximum consecutive mismatches to the specified number of lines. /N Displays the line numbers on an ASCII comparison. /OFF[LINE] Do not skip files with offline attribute set. /T Does not expand tabs to spaces. /U Compare files as UNICODE text files. /W Compresses white space (tabs and spaces) for comparison. /nnnn Specifies the number of consecutive lines that must match after a mismatch. [drive1:][path1]filename1 Specifies the first file or set of files to compare. [drive2:][path2]filename2 Specifies the second file or set of files to compare. [download]	[reply] [d/l]