Re: Comparing Large Files

If your files are too big to handle using your system utils, or to build a hash from the lines of the first to compare the second against, you can trade some time for speed.

Build the hash from the MD5 sum (or other digest ) of each line and then compare the MD5s of the lines in the second file against the first. Slower, but uses less memory.

Using this on two files around 125MB in size and over 2 million lines in each, it found the 3 different lines (all appended to the end of the second file) in around 8 minutes on my lowly PII/233MHz. Memory consumption was < 3MB, but many of the lines were identical. 16 bytes (+overhead) per unique line maybe enough to squeeze the hash into ram.

On 2x 150k/2500 line files with a dozen 1 char changes, it found all the changed lines in less than a second.

I also have a version that uses a bigger buffer which shows some benefits on the larger file but is slower on the smaller one.

#! perl -slw
use strict;
use Digest::MD5 qw[md5];

my %h;

open my $f1, '<', $ARGV[0] or die $!;
while( <$f1> ) {
        $h{ md5 $_ } = undef;
}
close $f1;

print "Lines found in $ARGV[1] not seen in $ARGV[0]";

open my $f2, '<', $ARGV[1] or die $!;
while( <$f2> ) {
        print "$.:$_" unless exists $h{ md5 $_ };
}
close $f2;

 __END__
E:\>copy bigfile.dat bigfile2.dat
        1 file(s) copied.
E:\>echo "this line was added to bigfile2" >>bigfile2.dat
E:\>echo "and this line was added to bigfile2" >>bigfile2.dat
E:\>echo "and so was this line added to bigfile2" >>bigfile2.dat
E:\>prompt $T$S$P$G

 4:03:28.61 D:\Perl\test>258709 e:\bigfile.dat e:\bigfile2.dat

Lines found in e:\bigfile2.dat not seen in e:\bigfile.dat

2378882:<START"this line was added to bigfile2"
2378883:"and this line was added to bigfile2"
2378884:"and so was this line added to bigfile2"

 4:11:28.82 D:\Perl\test>
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Comment on Re: Comparing Large Files Download Code