Actually, if FILE1 were truly huge, say a billion lines, a linear pass through is not only the right thing but probably the only way you could really do it.
What matters is how much you have to keep in memory at any given time, i.e., how many different ranges do you have to compute counts for, i.e., how big is FILE2?
Then again 100M of memory is not that big a deal these days so even in the worst case where every line of FILE1 is ending up in a different interval, it's still doable to have all of those counts in a single hash.
Here's my first stab (note: not tested or anything). Assumes the intervals don't overlap, but if they do you'll find out:
use warnings; use strict; our %intervals = (); our %counts = (); open F2,"<FILE2" or die; while (<F2>) { chomp; my ($pre,$lower,$upper) = split /\s-+/; my $intvs = $intervals{$pre}; my $i = scalar( grep { $_->[0]<=$lower } @$intvs); die "overlap found: $pre $intvs->[$i-1]->[0]..$intvs->[$i-1]->[1] vs +. $lower..$upper" if( $i>0 && !$intvs->[$i-1]->[1]<=$lower ); die "overlap found: $pre $lower..$upper vs. $intvs->[$i]->[0]..$intv +s->[$i]->[1]" if( $i<@$intvs && !$upper<=$intvs->[$i]->[0] ); splice @{$intervals{$pre}},$i,0,[$lower,$upper]; } close F2; open F1,"<FILE1" or die; while (<F1>) { chomp; my ($pre,$n,$r) = split /\s-+/; my ($intv) = grep {$_->[0] <= $n && $n < $_->[1]} $intervals{$pre}; $counts{"$pre $intv->[0]"} += $r; } close F1; for (keys %counts) { print "$_ $counts{$_}\n"; }
In reply to Re: Count in intervals
by wrog
in thread Count in intervals
by rkk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |