Algorithm::Diff or Text::Diff might be good starting places if you haven't checked them out yet.

I can't tell from your description of the data, but maybe something like this would also work: (Update: ignore this code -- see the one below that uses Tie::File)

my @search_space; my $N = 100; foreach my $line (<FILE_A>) { # fill up @search_space with lines from FILE_B until it has N th +ings # of course, this pukes when FILE_B reaches EOF ;) push(@search_space, <FILE_B>) while (@search_space < $N); # look for $line in @search_space my $pos = find(\@search_space, $line); if ($pos >= 0) { splice(@search_space, 0, $pos); # shift off the stuff in @search_space before the match } else { print "No match for $line"; } } sub find { my ($arr_ref, $find) = @_; $arr_ref->[$_] eq $find and return $_ for (0 .. $#{$arr_ref}); return -1; }
Are you just searching for the contents of one file within another, or do you also care about things that are in FILE_B and missing from FILE_A?

Update: well, this only looked *ahead* N lines for the next match, but the important idea is to only search through a small buffer at a time. Now that I think of it, you could use Tie::File to tie FILE_B to an array: keep track of the line number of the last match, and only search within a 2*N slice of the array/file each time. That makes it easier if the matches back up, but of course your performance depends on the speed of Tie::File.

Here's code looking forward and backwards up to N lines from the last match using the method I just mentioned (untested):

my $N = 100; my @file_b; tie @file_b, Tie::File, 'path/to/file_b' or die; my $last_match = 0; for my $line (<FILE_A>) { chomp $line; my $min = $last_match - $N >= 0 ? $last_match - $N : 0; my $max = $last_match + $N <= $#file_b ? $last_match + $N : $#fi +le_b; my $pos = find(\@file_b[$min .. $max], $line); if ($pos >= 0) { $last_match = $min + $pos; } else { print "No match for $line\n"; } }

blokhead


In reply to Re: Comparing Large Files by blokhead
in thread Comparing Large Files by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.