I can't tell from your description of the data, but maybe something like this would also work: (Update: ignore this code -- see the one below that uses Tie::File)
Are you just searching for the contents of one file within another, or do you also care about things that are in FILE_B and missing from FILE_A?my @search_space; my $N = 100; foreach my $line (<FILE_A>) { # fill up @search_space with lines from FILE_B until it has N th +ings # of course, this pukes when FILE_B reaches EOF ;) push(@search_space, <FILE_B>) while (@search_space < $N); # look for $line in @search_space my $pos = find(\@search_space, $line); if ($pos >= 0) { splice(@search_space, 0, $pos); # shift off the stuff in @search_space before the match } else { print "No match for $line"; } } sub find { my ($arr_ref, $find) = @_; $arr_ref->[$_] eq $find and return $_ for (0 .. $#{$arr_ref}); return -1; }
Update: well, this only looked *ahead* N lines for the next match, but the important idea is to only search through a small buffer at a time. Now that I think of it, you could use Tie::File to tie FILE_B to an array: keep track of the line number of the last match, and only search within a 2*N slice of the array/file each time. That makes it easier if the matches back up, but of course your performance depends on the speed of Tie::File.
Here's code looking forward and backwards up to N lines from the last match using the method I just mentioned (untested):
my $N = 100; my @file_b; tie @file_b, Tie::File, 'path/to/file_b' or die; my $last_match = 0; for my $line (<FILE_A>) { chomp $line; my $min = $last_match - $N >= 0 ? $last_match - $N : 0; my $max = $last_match + $N <= $#file_b ? $last_match + $N : $#fi +le_b; my $pos = find(\@file_b[$min .. $max], $line); if ($pos >= 0) { $last_match = $min + $pos; } else { print "No match for $line\n"; } }
blokhead
In reply to Re: Comparing Large Files
by blokhead
in thread Comparing Large Files
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |