comment on

Algorithm::Diff or Text::Diff might be good starting places if you haven't checked them out yet.

I can't tell from your description of the data, but maybe something like this would also work: (Update: ignore this code -- see the one below that uses Tie::File)

  my @search_space;
  my $N = 100;
  foreach my $line (<FILE_A>) {
      # fill up @search_space with lines from FILE_B until it has N th
+ings
      # of course, this pukes when FILE_B reaches EOF ;)

      push(@search_space, <FILE_B>) while (@search_space < $N);

      # look for $line in @search_space

      my $pos = find(\@search_space, $line);
      if ($pos >= 0) {
          splice(@search_space, 0, $pos);
          # shift off the stuff in @search_space before the match
      } else {
          print "No match for $line";
      }

  }

  sub find {
      my ($arr_ref, $find) = @_;
      $arr_ref->[$_] eq $find and return $_ for (0 .. $#{$arr_ref});
      return -1;
  }
[download]

Are you just searching for the contents of one file within another, or do you also care about things that are in FILE_B and missing from FILE_A?

Update: well, this only looked *ahead* N lines for the next match, but the important idea is to only search through a small buffer at a time. Now that I think of it, you could use Tie::File to tie FILE_B to an array: keep track of the line number of the last match, and only search within a 2*N slice of the array/file each time. That makes it easier if the matches back up, but of course your performance depends on the speed of Tie::File.

Here's code looking forward and backwards up to N lines from the last match using the method I just mentioned (untested):

  my $N = 100;
  my @file_b;
  tie @file_b, Tie::File, 'path/to/file_b' or die;
  my $last_match = 0;

  for my $line (<FILE_A>) {
      chomp $line;
      my $min = $last_match - $N >= 0 ? $last_match - $N : 0;
      my $max = $last_match + $N <= $#file_b ? $last_match + $N : $#fi
+le_b;

      my $pos = find(\@file_b[$min .. $max], $line);
      if ($pos >= 0) {
          $last_match = $min + $pos;
      } else {
          print "No match for $line\n";
      }
  }
[download]

blokhead

In reply to Re: Comparing Large Files by blokhead
in thread Comparing Large Files by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.