comment on

Actually, if FILE1 were truly huge, say a billion lines, a linear pass through is not only the right thing but probably the only way you could really do it.

What matters is how much you have to keep in memory at any given time, i.e., how many different ranges do you have to compute counts for, i.e., how big is FILE2?

Then again 100M of memory is not that big a deal these days so even in the worst case where every line of FILE1 is ending up in a different interval, it's still doable to have all of those counts in a single hash.

Here's my first stab (note: not tested or anything). Assumes the intervals don't overlap, but if they do you'll find out:

use warnings;
use strict;

our %intervals = ();
our %counts = ();

open F2,"<FILE2" or die;
while (<F2>) {
  chomp;
  my ($pre,$lower,$upper) = split /\s-+/;
  my $intvs = $intervals{$pre};
  my $i = scalar( grep { $_->[0]<=$lower } @$intvs);
  die "overlap found: $pre $intvs->[$i-1]->[0]..$intvs->[$i-1]->[1] vs
+. $lower..$upper"
    if( $i>0 && !$intvs->[$i-1]->[1]<=$lower );

  die "overlap found: $pre $lower..$upper vs. $intvs->[$i]->[0]..$intv
+s->[$i]->[1]"
    if( $i<@$intvs && !$upper<=$intvs->[$i]->[0] );

  splice @{$intervals{$pre}},$i,0,[$lower,$upper];
}
close F2;

open F1,"<FILE1" or die;
while (<F1>) {
  chomp;
  my ($pre,$n,$r) = split /\s-+/;
  my ($intv) = grep {$_->[0] <= $n && $n < $_->[1]} $intervals{$pre};
  $counts{"$pre $intv->[0]"} += $r;
}
close F1;

for (keys %counts) {
   print "$_ $counts{$_}\n";
}
[download]

In reply to Re: Count in intervals by wrog
in thread Count in intervals by rkk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.