Re: Count in intervals

Actually, if FILE1 were truly huge, say a billion lines, a linear pass through is not only the right thing but probably the only way you could really do it.

What matters is how much you have to keep in memory at any given time, i.e., how many different ranges do you have to compute counts for, i.e., how big is FILE2?

Then again 100M of memory is not that big a deal these days so even in the worst case where every line of FILE1 is ending up in a different interval, it's still doable to have all of those counts in a single hash.

Here's my first stab (note: not tested or anything). Assumes the intervals don't overlap, but if they do you'll find out:

use warnings;
use strict;

our %intervals = ();
our %counts = ();

open F2,"<FILE2" or die;
while (<F2>) {
  chomp;
  my ($pre,$lower,$upper) = split /\s-+/;
  my $intvs = $intervals{$pre};
  my $i = scalar( grep { $_->[0]<=$lower } @$intvs);
  die "overlap found: $pre $intvs->[$i-1]->[0]..$intvs->[$i-1]->[1] vs
+. $lower..$upper"
    if( $i>0 && !$intvs->[$i-1]->[1]<=$lower );

  die "overlap found: $pre $lower..$upper vs. $intvs->[$i]->[0]..$intv
+s->[$i]->[1]"
    if( $i<@$intvs && !$upper<=$intvs->[$i]->[0] );

  splice @{$intervals{$pre}},$i,0,[$lower,$upper];
}
close F2;

open F1,"<FILE1" or die;
while (<F1>) {
  chomp;
  my ($pre,$n,$r) = split /\s-+/;
  my ($intv) = grep {$_->[0] <= $n && $n < $_->[1]} $intervals{$pre};
  $counts{"$pre $intv->[0]"} += $r;
}
close F1;

for (keys %counts) {
   print "$_ $counts{$_}\n";
}
[download]

Comment on Re: Count in intervals Download Code

Replies are listed 'Best First'.
Re^2: Count in intervals by no_slogan (Deacon) on Oct 01, 2014 at 06:23 UTC
And if both files are really huge, you use a modified Merge sort-type algorithm. (The last pass does the interval calculation instead of just sorting items.)	[reply]