I have a rather large file (640MB) of coordinates that each show the position (start,end) of a particular feature on a particular chromosome. Something like:
chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2
Given another set of coordinates and a chromosome number I want to be able to quickly see if there are any features that the coordinates overlap. For example, given:
chromosome = chr2 start = 160 end = 210
I'd like the result:
chr2 150 200 feature1Importantly I need to do a large number of these types of lookups in one go (~1000 or so).
So far I have the following. Now this works and I'm quite happy with the performance but it takes up 1.6GB memory when I read in the full 640MB file which contains over 5 million lines of coordinates.
If it helps and/or you really want to you can get this file from HERE but you'll have to change the column numbers to pull out the right fields!
#!/usr/bin/perl use strict; my %reps = (); while(my $line = <DATA>){ $line =~ s/[\n\r]//g; my @array = split(/\s+/,$line); $reps{$array[0]}{$array[1]}{'end'} = $array[2]; $reps{$array[0]}{$array[1]}{'rep'} = $array[3]; } my $start = 160; my $end = 210; my $chr = "chr2"; for my $s (sort {$a<=>$b} keys %{$reps{$chr}}){ if ($start <= $reps{$chr}{$s}{'end'}) { last if $s >= $end; print "$chr $s $reps{$chr}{$s}{'end'} $reps{$chr}{$s}{'rep'}\n"; } } __DATA__ chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2
My questions are:
a) Is there a better way to do this?
b) Can the memory footprint be reduced any?
Many, many thanks for any advice.
Rich
In reply to Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |