in reply to Reducing memory footprint when doing a lookup of millions of coordinates

You could turn the problem inside out: load the test values into memory then scan the large reference file one line at a time to perform the matching:

#!/usr/bin/perl use strict; my $reps = <<REPS; chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2 REPS my %tests; while (my $line = <DATA>) { $line =~ s/[\n\r]//g; my @array = split /\s+/, $line; $tests{$array[0]}{$array[1]}{'end'} = $array[2]; $tests{$array[0]}{$array[1]}{'rep'} = $array[3]; } open my $repIn, '<', \$reps; while (<$repIn>) { my ($chr, $start, $end, $rep) = split ' '; next if !exists $tests{$chr}; for my $s (keys %{$tests{$chr}}) { if ($start <= $tests{$chr}{$s}{'end'}) { last if $s >= $end; print "$chr $start $end $rep\n"; } } } __DATA__ chr2 160 210
True laziness is hard work
  • Comment on Re: Reducing memory footprint when doing a lookup of millions of coordinates
  • Download Code

Replies are listed 'Best First'.
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates
by richardwfrancis (Beadle) on Feb 27, 2011 at 12:18 UTC
    Hi GrandFather,

    You wont believe me but I thought about this after I posted but I haven't tested it yet! If the database idea proves problematic I think this is the way to go.

    Many thanks for your help and the code to help me out.

    Rich