open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n");
open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n");
# Load up the second file into a hash,
# where each line of the file is a key.
%file2 = map { $_ => 1 } <FILE2>;
while (<FILE1>) {
if ($file2{$_}) {
print("Found $_");
} else {
print("Didn't find $_");
}
}
__END__
file1.txt
=========
qwerty
snakegod
ebrine
tarot
file2.txt
=========
snakegod
ordo rosae moriatur
tarot
wrath of hibernia
output
======
Didn't find qwerty
Found snakegod
Didn't find ebrine
Found tarot
| [reply] [d/l] |
# A version that also checks for lines in file2 that are not in file1:
open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n");
open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n");
%file1 = map { $_ => 1 } <FILE1>;
%file2 = map { $_ => 1 } <FILE2>;
foreach (keys(%file1)) {
if ($file2{$_}) {
print("Found in both files: $_");
} else {
print("Found only in first file: $_");
}
}
foreach (keys(%file2)) {
unless ($file1{$_}) {
print("Found only in second file: $_");
}
}
| [reply] [d/l] |
# This version adds difference counts:
open(FILE1, '<file1.txt') or die("Cannot open first file: $!.\n");
open(FILE2, '<file2.txt') or die("Cannot open second file: $!.\n");
$file1{$_}++ while (<FILE1>);
$file2{$_}++ while (<FILE2>);
foreach (keys(%file1)) {
if ($file2{$_}) {
$diff = $file2{$_} - $file1{$_};
if ($diff) {
if ($diff < 0) {
print("Found in first file $diff times more than in second
+ file: $_");
} else {
print("Found in second file $diff times more than in first
+ file: $_");
}
} else {
print("Found in both files an equal number of times: $_");
}
} else {
print("Found only in first file ($file1{$_} times): $_");
}
}
foreach (keys(%file2)) {
unless ($file1{$_}) {
print("Found only in second file ($file2{$_} times): $_");
}
}
| [reply] [d/l] |
If file 2 is fairly small, I'd put that file into a hash:
my %file_data;
open( my $fh, '<', "/path/to/file2" ) or die $!;
while(<>) {
chomp;
$file_data{$_} = $.;
}
close $fh;
You can then go through file1 line-by-line and lookup the hash entry. The value will be the line number that entry is on.
"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.
| [reply] [d/l] |
The hash approach is probably best, if you can guarantee that file 2 will always be small enough to fit into memory. Iterating through file 1, looking for an equal key in the hash holding file 2 will be an O(n) operation (the hash lookup will be O(1)). Yes, there is some time involved in building the hash, but that's only done once, so at worst, you would be looking at O(2N), which isn't really big-oh (constant multipliers are usually not considered). Whereas iterating through file 1, and greping file 2 for the same line will be O(n^2) (assuming the second file is about the same size as the first).
One possibility exists for which your question remained silent: What happens if something in File 2 doesn't exist in file 1? The methods proposed will silently allow that to happen, and in fact, your question leads me to believe that's fine too. But just in case, you should realize that your question didn't cover that possibility -- probably not a problem, but something to remember.
| [reply] |
After reading your commment (and adapting slightly from hardburn's comment), I came up with the following code using hashs (as mentioned above , and with the same cautions), which handles both the case of an entry in file 2 but not file 1, as well as multiple occurrences of an entry in a file (by listing the locations in the results). It does not, however, cover the difference in the number of occurrences of an entry in the two files. (Data files adapted from those in the comment by ikegami.)
#!/usr/bin/perl -w
use strict;
if ( scalar(@ARGV) < 2 ) {
print "Usage:\n\t$0 file1 file2\n\n";
die;
}
my @filename = ( $ARGV[0], $ARGV[1] );
my (@content);
foreach my $i ( 0, 1 ) {
open( DF, $filename[$i] )
or die("Can't open $filename[$i] for input: $!\n");
while (<DF>) {
chomp;
push( @{ $content[$i]{$_} }, $. );
}
close(DF);
}
my @keycount = (
scalar( keys( %{ $content[0] } ) ),
scalar( keys( %{ $content[1] } ) )
);
if ( $keycount[0] != $keycount[1] ) {
my @differential = @filename;
if ( $keycount[0] > $keycount[1] ) {
@differential = reverse(@filename);
}
print "Fewer values detected in ", $differential[0],
" than ", $differential[1], "\n";
}
foreach my $k ( sort( keys( %{ $content[0] } ) ) ) {
if ( defined( $content[1]{$k} ) ) {
print $k, "\n";
foreach ( 0, 1 ) {
print "\tFound in ", $filename[$_], " at line(s): ",
join( ', ', @{ $content[$_]{$k} } ), "\n";
delete( $content[$_]{$k} );
}
}
}
@keycount = (
scalar( keys( %{ $content[0] } ) ),
scalar( keys( %{ $content[1] } ) )
);
if ( $keycount[0] or $keycount[1] ) {
foreach ( 0, 1 ) {
if ( $keycount[$_] ) {
print "Found in ", $filename[$_], " but not in ",
$filename[ ( $_ + 1 ) % 2 ], ":\n";
foreach my $k ( sort( keys( %{ $content[$_] } ) ) ) {
print "\t'", $k, "' at line(s): ",
join( ', ', @{ $content[$_]{$k} } ), "\n";
delete( $content[$_]{$k} );
}
}
}
}
Sample input files:
Sample execution runs:
Hope that helps.
| [reply] [d/l] [select] |
First, if the only reason you aren't using 'diff' is that the lines are in different locations, you can sort the files before diffing. (Depending on what you are trying to do, you may want to use the '-u' flag to remove duplicates).
Note that using 'diff' is *not* the same as reading one file and comparing the line against the second file. You aren't checking for extra lines that appear in the second file. Also, you aren't checking for duplicates (i.e. 2 identical lines in the first file match 1 line in the second).
bluto | [reply] |
List::Compare
One of my favorite modules. It implements an exercise in the Perl Cookbook. | [reply] |
$ sort <file1 >file1.tmp
$ sort <file2 >file2.tmp
$ diff -q file1.tmp file2.tmp
| [reply] [d/l] |
If you don't need to know where on what line in file2 a line from file1 was used, then I would use something like this:
open (F1, "<file1.txt");
open (F2, "<file2.txt");
my $file2;
{
local $/;
$file2 = <F2>;
}
while (<F1>) {
print "Line $. not found.", unless ($file2 =~ /^$_/m);
}
close (F1);
close (F2);
This will put all the content of file2 in a simple scalar, and then check if the line occures by using a regex.
| [reply] [d/l] |
AM's suggestion works for many cases but if the text in your files contains regex metachars, you'll need to tweak the regex a bit.
For example, if you had a reference to C++ in your lines, and you use warnings (obligatory warning: you should!), then you'll get a warning about nested quantifiers.
| [reply] |