comparing 2 files problem

mosh has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: comparing 2 files problem by ikegami (Patriarch) on Sep 07, 2004 at 15:36 UTC
How about open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); # Load up the second file into a hash, # where each line of the file is a key. %file2 = map { $_ => 1 } <FILE2>; while (<FILE1>) { if ($file2{$_}) { print("Found $_"); } else { print("Didn't find $_"); } } __END__ file1.txt ========= qwerty snakegod ebrine tarot file2.txt ========= snakegod ordo rosae moriatur tarot wrath of hibernia output ====== Didn't find qwerty Found snakegod Didn't find ebrine Found tarot [download]	[reply] [d/l]
Re^2: comparing 2 files problem by ikegami (Patriarch) on Sep 07, 2004 at 18:52 UTC
`# A version that also checks for lines in file2 that are not in file1: open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); %file1 = map { $_ => 1 } <FILE1>; %file2 = map { $_ => 1 } <FILE2>; foreach (keys(%file1)) { if ($file2{$_}) { print("Found in both files: $_"); } else { print("Found only in first file: $_"); } } foreach (keys(%file2)) { unless ($file1{$_}) { print("Found only in second file: $_"); } }` [download]	[reply] [d/l]
Re^3: comparing 2 files problem by ikegami (Patriarch) on Sep 07, 2004 at 18:54 UTC
# This version adds difference counts: open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n"); open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n"); $file1{$_}++ while (<FILE1>); $file2{$_}++ while (<FILE2>); foreach (keys(%file1)) { if ($file2{$_}) { $diff = $file2{$_} - $file1{$_}; if ($diff) { if ($diff < 0) { print("Found in first file $diff times more than in second + file: $_"); } else { print("Found in second file $diff times more than in first + file: $_"); } } else { print("Found in both files an equal number of times: $_"); } } else { print("Found only in first file ($file1{$_} times): $_"); } } foreach (keys(%file2)) { unless ($file1{$_}) { print("Found only in second file ($file2{$_} times): $_"); } } [download]	[reply] [d/l]
Re: comparing 2 files problem by hardburn (Abbot) on Sep 07, 2004 at 15:37 UTC
If file 2 is fairly small, I'd put that file into a hash: `my %file_data; open( my $fh, '<', "/path/to/file2" ) or die $!; while(<>) { chomp; $file_data{$_} = $.; } close $fh;` [download] You can then go through file1 line-by-line and lookup the hash entry. The value will be the line number that entry is on. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply] [d/l]
Re: comparing 2 files problem by davido (Cardinal) on Sep 07, 2004 at 15:56 UTC
The hash approach is probably best, if you can guarantee that file 2 will always be small enough to fit into memory. Iterating through file 1, looking for an equal key in the hash holding file 2 will be an O(n) operation (the hash lookup will be O(1)). Yes, there is some time involved in building the hash, but that's only done once, so at worst, you would be looking at O(2N), which isn't really big-oh (constant multipliers are usually not considered). Whereas iterating through file 1, and greping file 2 for the same line will be O(n^2) (assuming the second file is about the same size as the first). One possibility exists for which your question remained silent: What happens if something in File 2 doesn't exist in file 1? The methods proposed will silently allow that to happen, and in fact, your question leads me to believe that's fine too. But just in case, you should realize that your question didn't cover that possibility -- probably not a problem, but something to remember. Dave	[reply]
Re^2: comparing 2 files problem by atcroft (Abbot) on Sep 07, 2004 at 17:11 UTC
After reading your commment (and adapting slightly from hardburn's comment), I came up with the following code using hashs (as mentioned above , and with the same cautions), which handles both the case of an entry in file 2 but not file 1, as well as multiple occurrences of an entry in a file (by listing the locations in the results). It does not, however, cover the difference in the number of occurrences of an entry in the two files. (Data files adapted from those in the comment by ikegami.) #!/usr/bin/perl -w use strict; if ( scalar(@ARGV) < 2 ) { print "Usage:\n\t$0 file1 file2\n\n"; die; } my @filename = ( $ARGV[0], $ARGV[1] ); my (@content); foreach my $i ( 0, 1 ) { open( DF, $filename[$i] ) or die("Can't open $filename[$i] for input: $!\n"); while (<DF>) { chomp; push( @{ $content[$i]{$_} }, $. ); } close(DF); } my @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] != $keycount[1] ) { my @differential = @filename; if ( $keycount[0] > $keycount[1] ) { @differential = reverse(@filename); } print "Fewer values detected in ", $differential[0], " than ", $differential[1], "\n"; } foreach my $k ( sort( keys( %{ $content[0] } ) ) ) { if ( defined( $content[1]{$k} ) ) { print $k, "\n"; foreach ( 0, 1 ) { print "\tFound in ", $filename[$_], " at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] or $keycount[1] ) { foreach ( 0, 1 ) { if ( $keycount[$_] ) { print "Found in ", $filename[$_], " but not in ", $filename[ ( $_ + 1 ) % 2 ], ":\n"; foreach my $k ( sort( keys( %{ $content[$_] } ) ) ) { print "\t'", $k, "' at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } } [download] Sample input files: Read more... (395 Bytes) Sample execution runs: Read more... (2 kB) Hope that helps.	[reply] [d/l] [select]
Re: comparing 2 files problem by bluto (Curate) on Sep 07, 2004 at 15:54 UTC
First, if the only reason you aren't using 'diff' is that the lines are in different locations, you can sort the files before diffing. (Depending on what you are trying to do, you may want to use the '-u' flag to remove duplicates). Note that using 'diff' is not the same as reading one file and comparing the line against the second file. You aren't checking for extra lines that appear in the second file. Also, you aren't checking for duplicates (i.e. 2 identical lines in the first file match 1 line in the second). bluto	[reply]
Re: comparing 2 files problem by McMahon (Chaplain) on Sep 07, 2004 at 16:32 UTC
List::Compare One of my favorite modules. It implements an exercise in the Perl Cookbook.	[reply]
Re: comparing 2 files problem by ysth (Canon) on Sep 07, 2004 at 17:50 UTC
`$ sort <file1 >file1.tmp $ sort <file2 >file2.tmp $ diff -q file1.tmp file2.tmp` [download]	[reply] [d/l]
Re: comparing 2 files problem by Anonymous Monk on Sep 07, 2004 at 17:02 UTC
If you don't need to know where on what line in file2 a line from file1 was used, then I would use something like this: `open (F1, "<file1.txt"); open (F2, "<file2.txt"); my $file2; { local $/; $file2 = <F2>; } while (<F1>) { print "Line $. not found.", unless ($file2 =~ /^$_/m); } close (F1); close (F2);` [download] This will put all the content of file2 in a simple scalar, and then check if the line occures by using a regex.	[reply] [d/l]
Re^2: comparing 2 files problem by ww (Archbishop) on Sep 08, 2004 at 17:13 UTC
AM's suggestion works for many cases but if the text in your files contains regex metachars, you'll need to tweak the regex a bit. For example, if you had a reference to C++ in your lines, and you use warnings (obligatory warning: you should!), then you'll get a warning about nested quantifiers.	[reply]