in reply to Reducing memory footprint when doing a lookup of millions of coordinates

First, a question. Is it possible to have two (or more) features that start at the same position in the same chromosome, but end at different positions?

If so, your current data structure will only record the last one read from the file.

Assuming that's okay, then moving from using a HoHoHs to a HoHoAs:

#!/usr/bin/perl -slw use strict; use constant { END => 0, REP => 1 }; my %reps; while(<>){ chomp; my @array = split; $reps{ $array[0] }{ $array[1] } = [ $array[2], $array[3] ]; } my $start = 160; my $end = 210; my $chr = "chr2"; for my $s ( sort { $a <=> $b } keys %{ $reps{ $chr } } ){ if( $start <= $reps{ $chr }{ $s }[ END ] ) { last if $s >= $end; print "$chr $s $reps{ $chr }{ $s }[ END ] $reps{ $chr }{ $s }[ + REP ]\n"; } }

Will likely save you ~25% of your memory usage and run a little faster.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Reducing memory footprint when doing a lookup of millions of coordinates
  • Download Code

Replies are listed 'Best First'.
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates
by richardwfrancis (Beadle) on Feb 27, 2011 at 12:23 UTC
    Cheers BrowserUK,

    I'll definitely give that a go.

    Out of interest is there an advantage to using the constant for END and REP rather than 0 and 1? I've not used that before.

    Many thanks for your help

    Rich
      is there an advantage to using the constant for END and REP rather than 0 and 1?

      Beyond a little extra clarity, no. I did it to make the two versions visibly comparible.

      There might be some extra memory savings to be had if you could give a clearer idea of the numbers involved.

      Ie. How many chromosomes? Approximate maximum lengths of both the chromosomes and the ranges?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Cool. That makes sense. Cheers

        There are 24 chromosomes I'm interested in in the file with the total number of records (ranges) per chromosome as follows:

        chr1 2235512 chr2 674652 chr3 348269 chr4 323500 chr5 308100 chr6 338158 chr7 280734 chr8 253229 chr9 224412 chr10 237524 chr11 240186 chr12 250300 chr13 161894 chr14 160126 chr15 152561 chr16 170145 chr17 167623 chr18 126566 chr19 134123 chr20 123693 chr21 61077 chr22 75260 chrX 265561 chrY 43169

        The length of the ranges are usually between about 10-300.

        Does this help?

Re^2: Reducing memory footprint when doing a lookup of millions of coordinates
by richardwfrancis (Beadle) on Feb 27, 2011 at 12:26 UTC

    By the way. In answer to your question, in this case this shouldn't happen but given the size of the data you're right that it's better to be safe than sorry.

    Rich