in reply to comparing 2 files problem

The hash approach is probably best, if you can guarantee that file 2 will always be small enough to fit into memory. Iterating through file 1, looking for an equal key in the hash holding file 2 will be an O(n) operation (the hash lookup will be O(1)). Yes, there is some time involved in building the hash, but that's only done once, so at worst, you would be looking at O(2N), which isn't really big-oh (constant multipliers are usually not considered). Whereas iterating through file 1, and greping file 2 for the same line will be O(n^2) (assuming the second file is about the same size as the first).

One possibility exists for which your question remained silent: What happens if something in File 2 doesn't exist in file 1? The methods proposed will silently allow that to happen, and in fact, your question leads me to believe that's fine too. But just in case, you should realize that your question didn't cover that possibility -- probably not a problem, but something to remember.


Dave

Replies are listed 'Best First'.
Re^2: comparing 2 files problem
by atcroft (Abbot) on Sep 07, 2004 at 17:11 UTC

    After reading your commment (and adapting slightly from hardburn's comment), I came up with the following code using hashs (as mentioned above , and with the same cautions), which handles both the case of an entry in file 2 but not file 1, as well as multiple occurrences of an entry in a file (by listing the locations in the results). It does not, however, cover the difference in the number of occurrences of an entry in the two files. (Data files adapted from those in the comment by ikegami.)

    #!/usr/bin/perl -w use strict; if ( scalar(@ARGV) < 2 ) { print "Usage:\n\t$0 file1 file2\n\n"; die; } my @filename = ( $ARGV[0], $ARGV[1] ); my (@content); foreach my $i ( 0, 1 ) { open( DF, $filename[$i] ) or die("Can't open $filename[$i] for input: $!\n"); while (<DF>) { chomp; push( @{ $content[$i]{$_} }, $. ); } close(DF); } my @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] != $keycount[1] ) { my @differential = @filename; if ( $keycount[0] > $keycount[1] ) { @differential = reverse(@filename); } print "Fewer values detected in ", $differential[0], " than ", $differential[1], "\n"; } foreach my $k ( sort( keys( %{ $content[0] } ) ) ) { if ( defined( $content[1]{$k} ) ) { print $k, "\n"; foreach ( 0, 1 ) { print "\tFound in ", $filename[$_], " at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } @keycount = ( scalar( keys( %{ $content[0] } ) ), scalar( keys( %{ $content[1] } ) ) ); if ( $keycount[0] or $keycount[1] ) { foreach ( 0, 1 ) { if ( $keycount[$_] ) { print "Found in ", $filename[$_], " but not in ", $filename[ ( $_ + 1 ) % 2 ], ":\n"; foreach my $k ( sort( keys( %{ $content[$_] } ) ) ) { print "\t'", $k, "' at line(s): ", join( ', ', @{ $content[$_]{$k} } ), "\n"; delete( $content[$_]{$k} ); } } } }

    Sample input files:

    Sample execution runs:

    Hope that helps.