Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks!
I am new to your site and I admit I have found many interesting and useful things about Perl here!
I want to do the following:
I have 2 files that may or may not have some different lines
(this I cannot know beforehand).
For example, you may have
FILE1 123 tomas kelly 345 678 ########### FILE2 123 nick kelly 459 678 OR FILE1 123 tomas kelly 345 678 ########### FILE2 123 tomas kelly 345 678
The good news is that the lines in file1 and file2 are equal, that is in line 1 of both files either you will find 123 or 123 (in one file) and something else in the other
What I want to do is read both files, and search to find if there is at least one difference.
I don't care if there are 10 or 100 differences between the 2 files, I just want to see if the files are identical line-by-line or they differ in one line at least.
I tried using diff but it did't work, so I was thinking if there is any quicker way...

Replies are listed 'Best First'.
Re: quick and safe way to deal with this?
by graff (Chancellor) on Aug 25, 2007 at 00:58 UTC
    I tried using diff but it did't work

    Are you talking about the unix/gnu diff tool? How did you use it, and in what sense did it not work? Given two files, if they are different, diff will report what the differences are, and will return an exit status of 1 if the files differ, or 0 (="success") if they are identical.

    If you don't want to see all the lines that differ, you can use the "cmp" command instead; it gives the same exit status as diff does, but will otherwise be fairly quiet, and will will only read all of its two input files when they happen to be identical (i.e. it stops as soon as it sees a difference). Examples:

    cmp file1 file2 && echo same cmp file1 file2 || echo different
    But as someone else pointed out, if you have lots of files and you need to compare all possible pairings of files, you'll be better off computing md5 checksums for them (try the "md5sum" command or roll your own in perl with Digest::MD5), then sort by checksum string and look for duplicate checksums.

    (Identical checksums is not a guarantee of identical content, different checksums definitely indicate different content. You only need to run cmp on pairs that have the same checksum, to confirm whether they are really identical.)

Re: quick and safe way to deal with this?
by BrowserUk (Patriarch) on Aug 25, 2007 at 00:34 UTC
    I tried using diff but it did't work

    Strange, your copy must be broken because this is exactly what diff does. And it will be far faster than any perl solution,

    Or maybe you just used it wrong?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: quick and safe way to deal with this?
by thezip (Vicar) on Aug 24, 2007 at 22:24 UTC

    By just checking the file sizes, this will work a lot of the time to differentiate the two files:

    #!/perl/bin/perl -w use strict; my $file1 = "2006.txt"; my $file2 = "2007.txt"; my $file1size = -s $file1; my $file2size = -s $file2; print "The files have ", $file1size == $file2size ? "the same" : "diff +erent", " size.\n\n";

    In what way did diff not work? Diff should be able to do exactly what you're trying to do...


    Where do you want *them* to go today?
Re: quick and safe way to deal with this?
by bart (Canon) on Aug 24, 2007 at 22:19 UTC

      If the files are very large and the differences are very early, you're going to spend a lot of time reading through a lot of file that you don't need to. Otherwise, this is a very easy way to find that files are different.

        But if the only differences are near the end, then you'll waste a lot of time going through the file a line at the time. Worse: if the files are equal, then you'll have to go through the whole file anyway.

        Why do people always suppose you'll commonly get positive results near the start of the loop count?

        Let's make a compromise: select and read a random block from both files, somewhere "in the middle", and compare. If the files are different, then you'll likely see it immediately, especially if the typical differences are in the addition or deletion of whole lines, and not replacement of single characters.

Re: quick and safe way to deal with this?
by kyle (Abbot) on Aug 25, 2007 at 00:05 UTC

    This should work for any number of files, up to the maximum you can have open.

    use List::Util qw( first ); my @files = ( 'FILE1', 'FILE2' ); my @handles; foreach my $file ( @files ) { open my $fh, '<', $file or die "Can't read $file: $!"; push @handles, $fh; } my $first_handle = shift @handles; while ( my $line = <$first_handle> ) { if ( defined first { $line ne <$_> } @handles ) { die "There's a difference."; } }
Re: quick and safe way to deal with this?
by technojosh (Priest) on Aug 25, 2007 at 13:18 UTC
    If this is on windows, you can use the 'fc' command to compare two files as well. Will also be faster than Perl, and does a few different things depending on how you want the files compared.

    From win32 command prompt:

    C:\>help fc Compares two files or sets of files and displays the differences betwe +en them FC [/A] [/C] [/L] [/LBn] [/N] [/OFF[LINE]] [/T] [/U] [/W] [/nnnn] [drive1:][path1]filename1 [drive2:][path2]filename2 FC /B [drive1:][path1]filename1 [drive2:][path2]filename2 /A Displays only first and last lines for each set of differ +ences. /B Performs a binary comparison. /C Disregards the case of letters. /L Compares files as ASCII text. /LBn Sets the maximum consecutive mismatches to the specified number of lines. /N Displays the line numbers on an ASCII comparison. /OFF[LINE] Do not skip files with offline attribute set. /T Does not expand tabs to spaces. /U Compare files as UNICODE text files. /W Compresses white space (tabs and spaces) for comparison. /nnnn Specifies the number of consecutive lines that must match after a mismatch. [drive1:][path1]filename1 Specifies the first file or set of files to compare. [drive2:][path2]filename2 Specifies the second file or set of files to compare.