comment on

If your files are too big to handle using your system utils, or to build a hash from the lines of the first to compare the second against, you can trade some time for speed.

Build the hash from the MD5 sum (or other digest ) of each line and then compare the MD5s of the lines in the second file against the first. Slower, but uses less memory.

Using this on two files around 125MB in size and over 2 million lines in each, it found the 3 different lines (all appended to the end of the second file) in around 8 minutes on my lowly PII/233MHz. Memory consumption was < 3MB, but many of the lines were identical. 16 bytes (+overhead) per unique line maybe enough to squeeze the hash into ram.

On 2x 150k/2500 line files with a dozen 1 char changes, it found all the changed lines in less than a second.

I also have a version that uses a bigger buffer which shows some benefits on the larger file but is slower on the smaller one.

#! perl -slw
use strict;
use Digest::MD5 qw[md5];

my %h;

open my $f1, '<', $ARGV[0] or die $!;
while( <$f1> ) {
        $h{ md5 $_ } = undef;
}
close $f1;

print "Lines found in $ARGV[1] not seen in $ARGV[0]";

open my $f2, '<', $ARGV[1] or die $!;
while( <$f2> ) {
        print "$.:$_" unless exists $h{ md5 $_ };
}
close $f2;

 __END__
E:\>copy bigfile.dat bigfile2.dat
        1 file(s) copied.
E:\>echo "this line was added to bigfile2" >>bigfile2.dat
E:\>echo "and this line was added to bigfile2" >>bigfile2.dat
E:\>echo "and so was this line added to bigfile2" >>bigfile2.dat
E:\>prompt $T$S$P$G

 4:03:28.61 D:\Perl\test>258709 e:\bigfile.dat e:\bigfile2.dat

Lines found in e:\bigfile2.dat not seen in e:\bigfile.dat

2378882:<START"this line was added to bigfile2"
2378883:"and this line was added to bigfile2"
2378884:"and so was this line added to bigfile2"

 4:11:28.82 D:\Perl\test>
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

In reply to Re: Comparing Large Files by BrowserUk
in thread Comparing Large Files by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.