Re: Tabulating Data Across Multiple Large Files

The general idea of "cosequential processing" is that the files are sorted on the key or keys you are matching on. That way you only need to keep one in memory for each file at a time, although you may have to rewind if you are doing a many-to-many join. More details of what you are doing now would be helpful.

One method that might be useful (hard to tell from your description) is to merge all of the files into one (using sort on Unix or sort -m if they are already sorted). Then all the keys that match will be together in the file and it is simple to write a program to process them. You may need to pre-process the files to make them suitable for this method.

Updated:Yes, now that I see your data, it looks like this method would be appropriate. You can sort -t, -k 1n,2 -k 4,5 file1 file2 file3 ... > sorted. Then it's a simple matter to process the sorted file:

#!perl -w
use strict;
my @sum = ();
my $prevkey ="";
while (<>) {
  chomp;
  my @data = split /,/, $_;
  next if $data[0] == 1; # skip headers

  my $key = join(",", @data[0, 1, 3, 4]);
  if ($key eq $prevkey) {
    for (0 .. $#data - 5) {
      $sum[$_] += $data[$_ + 5]
     }
   } else {
     dumpsums();
     $prevkey = $key;
     @sum = @data[5 .. $#data];
   }
}
dumpsums();

sub dumpsums {
  if ($prevkey) {
    print "$prevkey,", join(",", @sum), "\n";
  }
}
[download]

Comment on Re: Tabulating Data Across Multiple Large Files Select or Download Code