Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have to very large text files (about 300MB each) and I want to find the set of lines (defined as text between \n characters) that appear in one file but not the other. This would be easy if there wasn't so much data. I can't brute force this either... each file contains about 1.4 million lines. On the other hand, the files are somewhat ordered. I'm not sure what the disparity is, but for any given line in fileA, if its in fileB it will appear with N lines of the previous match. I suspect that maintaining some kind of look ahead cache will be the trick, but I'm not sure how to do it.

Replies are listed 'Best First'.
Re: Comparing Large Files
by blokhead (Monsignor) on May 16, 2003 at 18:24 UTC
    Algorithm::Diff or Text::Diff might be good starting places if you haven't checked them out yet.

    I can't tell from your description of the data, but maybe something like this would also work: (Update: ignore this code -- see the one below that uses Tie::File)

    my @search_space; my $N = 100; foreach my $line (<FILE_A>) { # fill up @search_space with lines from FILE_B until it has N th +ings # of course, this pukes when FILE_B reaches EOF ;) push(@search_space, <FILE_B>) while (@search_space < $N); # look for $line in @search_space my $pos = find(\@search_space, $line); if ($pos >= 0) { splice(@search_space, 0, $pos); # shift off the stuff in @search_space before the match } else { print "No match for $line"; } } sub find { my ($arr_ref, $find) = @_; $arr_ref->[$_] eq $find and return $_ for (0 .. $#{$arr_ref}); return -1; }
    Are you just searching for the contents of one file within another, or do you also care about things that are in FILE_B and missing from FILE_A?

    Update: well, this only looked *ahead* N lines for the next match, but the important idea is to only search through a small buffer at a time. Now that I think of it, you could use Tie::File to tie FILE_B to an array: keep track of the line number of the last match, and only search within a 2*N slice of the array/file each time. That makes it easier if the matches back up, but of course your performance depends on the speed of Tie::File.

    Here's code looking forward and backwards up to N lines from the last match using the method I just mentioned (untested):

    my $N = 100; my @file_b; tie @file_b, Tie::File, 'path/to/file_b' or die; my $last_match = 0; for my $line (<FILE_A>) { chomp $line; my $min = $last_match - $N >= 0 ? $last_match - $N : 0; my $max = $last_match + $N <= $#file_b ? $last_match + $N : $#fi +le_b; my $pos = find(\@file_b[$min .. $max], $line); if ($pos >= 0) { $last_match = $min + $pos; } else { print "No match for $line\n"; } }

    blokhead

Re: Comparing Large Files
by strat (Canon) on May 17, 2003 at 09:14 UTC
    <FUN> To print the identical lines in both files with a one liner from shell, you could use
    perl -MDigest::MD5 -ne "print if ($seen{&Digest::MD5::md5_hex($_)}.= @ +ARGV) =~ /10$/" fileA.txt fileB.txt > output.txt
    (see also 36725)

    </FUN>

    If the files are too big to fit into memory, but you want to deal with the lines as if they were arrays, you could have a look at Tie::File:

    use Tie::File; my @content; tie (@content, 'Tie::File', $file) or die "Error: couldn't tie $file: $!\n"; # .... and treat the filecontent like an array, e.g. print scalar(@content); # ..... untie (@content);
    for more details, see perldoc Tie::File, just beware that changes in such a tied array are also done in the tied file.

    Best regards,
    perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"

Re: Comparing Large Files
by BrowserUk (Patriarch) on May 17, 2003 at 04:29 UTC

    If your files are too big to handle using your system utils, or to build a hash from the lines of the first to compare the second against, you can trade some time for speed.

    Build the hash from the MD5 sum (or other digest ) of each line and then compare the MD5s of the lines in the second file against the first. Slower, but uses less memory.

    Using this on two files around 125MB in size and over 2 million lines in each, it found the 3 different lines (all appended to the end of the second file) in around 8 minutes on my lowly PII/233MHz. Memory consumption was < 3MB, but many of the lines were identical. 16 bytes (+overhead) per unique line maybe enough to squeeze the hash into ram.

    On 2x 150k/2500 line files with a dozen 1 char changes, it found all the changed lines in less than a second.

    I also have a version that uses a bigger buffer which shows some benefits on the larger file but is slower on the smaller one.

    #! perl -slw use strict; use Digest::MD5 qw[md5]; my %h; open my $f1, '<', $ARGV[0] or die $!; while( <$f1> ) { $h{ md5 $_ } = undef; } close $f1; print "Lines found in $ARGV[1] not seen in $ARGV[0]"; open my $f2, '<', $ARGV[1] or die $!; while( <$f2> ) { print "$.:$_" unless exists $h{ md5 $_ }; } close $f2; __END__ E:\>copy bigfile.dat bigfile2.dat 1 file(s) copied. E:\>echo "this line was added to bigfile2" >>bigfile2.dat E:\>echo "and this line was added to bigfile2" >>bigfile2.dat E:\>echo "and so was this line added to bigfile2" >>bigfile2.dat E:\>prompt $T$S$P$G 4:03:28.61 D:\Perl\test>258709 e:\bigfile.dat e:\bigfile2.dat Lines found in e:\bigfile2.dat not seen in e:\bigfile.dat 2378882:<START"this line was added to bigfile2" 2378883:"and this line was added to bigfile2" 2378884:"and so was this line added to bigfile2" 4:11:28.82 D:\Perl\test>

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: Comparing Large Files
by BazB (Priest) on May 16, 2003 at 17:53 UTC
    I'm not sure I quite get what you're after, but on UNIX man bdiff or man comm might point you in the right direction.

    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

      Thanks. I don't have bdiff, and comm didn't quite work... not sure why. I ran it (took awhile) and checked one of the first lines in the output... it was in both files. So that isn't what I wanted either. The ordering in the two files is close, but not perfect (that is, one could have abcd and the other adbc, but the distance should be limited to within a couple hundred lines). I suspect I'll have to write a custom script for this, but I could use some help.
Re: Comparing Large Files
by bluto (Curate) on May 16, 2003 at 18:35 UTC
    If you find that 'diff' style solutions are choking on the sheer size of the file, and you can get away with sorting, I suggest doing that. You may be able to tweak something like gnu diff to work, or some perl diff solution, eventually, but I haven't had good luck doing similar things.

    A decent 'sort' command should work reasonably quickly, but you may need to send the temporary files to another directory (gnutar has '-T' for this). I've sorted a file over a GB in size with over 6 million lines in less than an hour on a reasonable AIX box (probably more like half an hour) to do something similar. It shouldn't be hard to write something similar to comm.

    FWIW, if you really must maintain ordering I'd suggest rewriting the file with line numbers appended to each line, and then using sort and ignoring line numbers during the 'comm' phase. You can reconstruct the original line ordering from the appended line numbers.

    Update: s/conn/comm/

    bluto

Re: Comparing Large Files
by tilly (Archbishop) on May 17, 2003 at 21:27 UTC
    Can you use some disk space? Personally I would tie a hash to a BTREE on disk with DB_File, scan one file into the BTREE, then go through the other and do my test. The buffering is not essential to the performance, but a BTREE will happen to buffer that quite nicely to perform very well.
Re: Comparing Large Files
by Anonymous Monk on May 16, 2003 at 18:08 UTC
    I had to do this last year myself. I was working on a Solaris box, but the unix compare commands were too slow. I.e. It would run overnight with no result. :( I found out that the Win2000 file comparison "fc" works very fast (1-15 minutes depending on the data size.) This was on a PIII 660Mhz w/ 256MB ram.

    Dogz

      The problem with the core diff tools is that they assume the two files to be in perfect order... which they aren't. You could almost think of the problem as being a comparison of two sets of lines, where I want to know which lines aren't in both sets (ignoring order). Its a nasty problem if there were truly no order to the lines, but thankfully they are mostly in the same order, with some chunks out of order by a couple hundred lines (which is nothing given the size of the files... around 1.4 million lines)

        so just put them in the same order...

        PROMPT% cat file1 cat dog bird parrot perl yakko wakko PROMPT% cat file2 cat bird perl parrot yakko dot PROMPT% sort file1 > file1.sort PROMPT% sort file2 > file2.sort PROMPT% comm -23 file1.sort file2.sort > only.in.file1 PROMPT% comm -13 file1.sort file2.sort > only.in.file2 PROMPT% cat only.in.file1 dog wakko PROMPT% cat only.in.file2 dot

        I'll admit, sort isn't the fastest thing in the world for large files, but if you only need to run this once, then comming up with a complicated algorithm (which may or may not have bugs depending on your assumptions about this "N" you mentioned) probably isn't worth it.

        An ugly solution I can think of is breaking up the file into smaller Overlapping sub-parts that will fit into memory. Read in the smaller sections of the file into an array and then pick out the array that is in one but not the other. Note the arrays that are missing in the last thousand lines to make sure they don't show up in the next sub-part.