in reply to Comparing Large Files
If your files are too big to handle using your system utils, or to build a hash from the lines of the first to compare the second against, you can trade some time for speed.
Build the hash from the MD5 sum (or other digest ) of each line and then compare the MD5s of the lines in the second file against the first. Slower, but uses less memory.
Using this on two files around 125MB in size and over 2 million lines in each, it found the 3 different lines (all appended to the end of the second file) in around 8 minutes on my lowly PII/233MHz. Memory consumption was < 3MB, but many of the lines were identical. 16 bytes (+overhead) per unique line maybe enough to squeeze the hash into ram.
On 2x 150k/2500 line files with a dozen 1 char changes, it found all the changed lines in less than a second.
I also have a version that uses a bigger buffer which shows some benefits on the larger file but is slower on the smaller one.
#! perl -slw use strict; use Digest::MD5 qw[md5]; my %h; open my $f1, '<', $ARGV[0] or die $!; while( <$f1> ) { $h{ md5 $_ } = undef; } close $f1; print "Lines found in $ARGV[1] not seen in $ARGV[0]"; open my $f2, '<', $ARGV[1] or die $!; while( <$f2> ) { print "$.:$_" unless exists $h{ md5 $_ }; } close $f2; __END__ E:\>copy bigfile.dat bigfile2.dat 1 file(s) copied. E:\>echo "this line was added to bigfile2" >>bigfile2.dat E:\>echo "and this line was added to bigfile2" >>bigfile2.dat E:\>echo "and so was this line added to bigfile2" >>bigfile2.dat E:\>prompt $T$S$P$G 4:03:28.61 D:\Perl\test>258709 e:\bigfile.dat e:\bigfile2.dat Lines found in e:\bigfile2.dat not seen in e:\bigfile.dat 2378882:<START"this line was added to bigfile2" 2378883:"and this line was added to bigfile2" 2378884:"and so was this line added to bigfile2" 4:11:28.82 D:\Perl\test>
|
|---|