mosh has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I've encountered a difficulty, hopefully you can assist. I should compare between 2 files as following: Read line 1 from file 1, and search for the exact match on file 2(could be in different location rather than line 1), then read line 2 from file 1 and search for the exact match on file 2...etc. Of course notify an error if wrong match... I can't just use diff, since the location of the lines could be different in the 2 files. Anyone know how to begin writing? Maybe there is such a module doing it? My search on it give me nothing. Thanks beforehand, Mosh.

Replies are listed 'Best First'.
Re: comparing 2 files problem
by ikegami (Patriarch) on Sep 07, 2004 at 15:36 UTC

    How about

    open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); # Load up the second file into a hash, # where each line of the file is a key. %file2 = map { $_ => 1 } <FILE2>; while (<FILE1>) { if ($file2{$_}) { print("Found $_"); } else { print("Didn't find $_"); } } __END__ file1.txt ========= qwerty snakegod ebrine tarot file2.txt ========= snakegod ordo rosae moriatur tarot wrath of hibernia output ====== Didn't find qwerty Found snakegod Didn't find ebrine Found tarot
      # A version that also checks for lines in file2 that are not in file1: open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); %file1 = map { $_ => 1 } <FILE1>; %file2 = map { $_ => 1 } <FILE2>; foreach (keys(%file1)) { if ($file2{$_}) { print("Found in both files: $_"); } else { print("Found only in first file: $_"); } } foreach (keys(%file2)) { unless ($file1{$_}) { print("Found only in second file: $_"); } }
        # This version adds difference counts: open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); $file1{$_}++ while (<FILE1>); $file2{$_}++ while (<FILE2>); foreach (keys(%file1)) { if ($file2{$_}) { $diff = $file2{$_} - $file1{$_}; if ($diff) { if ($diff < 0) { print("Found in first file $diff times more than in second + file: $_"); } else { print("Found in second file $diff times more than in first + file: $_"); } } else { print("Found in both files an equal number of times: $_"); } } else { print("Found only in first file ($file1{$_} times): $_"); } } foreach (keys(%file2)) { unless ($file1{$_}) { print("Found only in second file ($file2{$_} times): $_"); } }
Re: comparing 2 files problem
by hardburn (Abbot) on Sep 07, 2004 at 15:37 UTC

    If file 2 is fairly small, I'd put that file into a hash:

    my %file_data; open( my $fh, '<', "/path/to/file2" ) or die $!; while(<>) { chomp; $file_data{$_} = $.; } close $fh;

    You can then go through file1 line-by-line and lookup the hash entry. The value will be the line number that entry is on.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: comparing 2 files problem
by davido (Cardinal) on Sep 07, 2004 at 15:56 UTC

    The hash approach is probably best, if you can guarantee that file 2 will always be small enough to fit into memory. Iterating through file 1, looking for an equal key in the hash holding file 2 will be an O(n) operation (the hash lookup will be O(1)). Yes, there is some time involved in building the hash, but that's only done once, so at worst, you would be looking at O(2N), which isn't really big-oh (constant multipliers are usually not considered). Whereas iterating through file 1, and greping file 2 for the same line will be O(n^2) (assuming the second file is about the same size as the first).

    One possibility exists for which your question remained silent: What happens if something in File 2 doesn't exist in file 1? The methods proposed will silently allow that to happen, and in fact, your question leads me to believe that's fine too. But just in case, you should realize that your question didn't cover that possibility -- probably not a problem, but something to remember.


    Dave

      After reading your commment (and adapting slightly from hardburn's comment), I came up with the following code using hashs (as mentioned above , and with the same cautions), which handles both the case of an entry in file 2 but not file 1, as well as multiple occurrences of an entry in a file (by listing the locations in the results). It does not, however, cover the difference in the number of occurrences of an entry in the two files. (Data files adapted from those in the comment by ikegami.)

      #!/usr/bin/perl -w use strict; if ( scalar(@ARGV) < 2 ) { print "Usage:\n\t$0 file1 file2\n\n"; die; } my @filename = ( $ARGV[0], $ARGV[1] ); my (@content); foreach my $i ( 0, 1 ) { open( DF, $filename[$i] ) or die("Can't open $filename[$i] for input: $!\n"); while (<DF>) { chomp; push( @{ $content[$i]{$_} }, $. ); } close(DF); } my @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] != $keycount[1] ) { my @differential = @filename; if ( $keycount[0] > $keycount[1] ) { @differential = reverse(@filename); } print "Fewer values detected in ", $differential[0], " than ", $differential[1], "\n"; } foreach my $k ( sort( keys( %{ $content[0] } ) ) ) { if ( defined( $content[1]{$k} ) ) { print $k, "\n"; foreach ( 0, 1 ) { print "\tFound in ", $filename[$_], " at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] or $keycount[1] ) { foreach ( 0, 1 ) { if ( $keycount[$_] ) { print "Found in ", $filename[$_], " but not in ", $filename[ ( $_ + 1 ) % 2 ], ":\n"; foreach my $k ( sort( keys( %{ $content[$_] } ) ) ) { print "\t'", $k, "' at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } }

      Sample input files:

      Sample execution runs:

      Hope that helps.

Re: comparing 2 files problem
by bluto (Curate) on Sep 07, 2004 at 15:54 UTC
    First, if the only reason you aren't using 'diff' is that the lines are in different locations, you can sort the files before diffing. (Depending on what you are trying to do, you may want to use the '-u' flag to remove duplicates).

    Note that using 'diff' is *not* the same as reading one file and comparing the line against the second file. You aren't checking for extra lines that appear in the second file. Also, you aren't checking for duplicates (i.e. 2 identical lines in the first file match 1 line in the second).

    bluto

Re: comparing 2 files problem
by McMahon (Chaplain) on Sep 07, 2004 at 16:32 UTC
    List::Compare

    One of my favorite modules. It implements an exercise in the Perl Cookbook.
Re: comparing 2 files problem
by ysth (Canon) on Sep 07, 2004 at 17:50 UTC
    $ sort <file1 >file1.tmp $ sort <file2 >file2.tmp $ diff -q file1.tmp file2.tmp
Re: comparing 2 files problem
by Anonymous Monk on Sep 07, 2004 at 17:02 UTC
    If you don't need to know where on what line in file2 a line from file1 was used, then I would use something like this:
    open (F1, "<file1.txt"); open (F2, "<file2.txt"); my $file2; { local $/; $file2 = <F2>; } while (<F1>) { print "Line $. not found.", unless ($file2 =~ /^$_/m); } close (F1); close (F2);

    This will put all the content of file2 in a simple scalar, and then check if the line occures by using a regex.

      AM's suggestion works for many cases but if the text in your files contains regex metachars, you'll need to tweak the regex a bit.

      For example, if you had a reference to C++ in your lines, and you use warnings (obligatory warning: you should!), then you'll get a warning about nested quantifiers.