Reducing memory footprint when doing a lookup of millions of coordinates

richardwfrancis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reducing memory footprint when doing a lookup of millions of coordinates by GrandFather (Saint) on Feb 27, 2011 at 10:29 UTC
You could turn the problem inside out: load the test values into memory then scan the large reference file one line at a time to perform the matching: #!/usr/bin/perl use strict; my $reps = <<REPS; chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2 REPS my %tests; while (my $line = <DATA>) { $line =~ s/[\n\r]//g; my @array = split /\s+/, $line; $tests{$array[0]}{$array[1]}{'end'} = $array[2]; $tests{$array[0]}{$array[1]}{'rep'} = $array[3]; } open my $repIn, '<', \$reps; while (<$repIn>) { my ($chr, $start, $end, $rep) = split ' '; next if !exists $tests{$chr}; for my $s (keys %{$tests{$chr}}) { if ($start <= $tests{$chr}{$s}{'end'}) { last if $s >= $end; print "$chr $start $end $rep\n"; } } } __DATA__ chr2 160 210 [download] True laziness is hard work	[reply] [d/l]
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 27, 2011 at 12:18 UTC
Hi GrandFather, You wont believe me but I thought about this after I posted but I haven't tested it yet! If the database idea proves problematic I think this is the way to go. Many thanks for your help and the code to help me out. Rich	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by moritz (Cardinal) on Feb 27, 2011 at 10:16 UTC
One improvement that should be straight forward to implement is to group everything by chromosome first. So far example read only those records from the file that are related to chr1. Build your `%reps` hash from them. Then search for all those coordinates on chr1 that you find interesting. Then empty the hash with `%reps = ()` to save memory, and read all chr2 coordinates. Other possible improvements: instead of using a hash for the coordinates, use two arrays (for begin and end), and then do a binary search in them. Of course a relational database like sqlite or postgres can do all these things for you if you build appropriate indexes over the coloumns. Perl 6 - second systems done right	[reply] [d/l] [select]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by BrowserUk (Patriarch) on Feb 27, 2011 at 10:31 UTC
First, a question. Is it possible to have two (or more) features that start at the same position in the same chromosome, but end at different positions? If so, your current data structure will only record the last one read from the file. Assuming that's okay, then moving from using a HoHoHs to a HoHoAs: `#!/usr/bin/perl -slw use strict; use constant { END => 0, REP => 1 }; my %reps; while(<>){ chomp; my @array = split; $reps{ $array[0] }{ $array[1] } = [ $array[2], $array[3] ]; } my $start = 160; my $end = 210; my $chr = "chr2"; for my $s ( sort { $a <=> $b } keys %{ $reps{ $chr } } ){ if( $start <= $reps{ $chr }{ $s }[ END ] ) { last if $s >= $end; print "$chr $s $reps{ $chr }{ $s }[ END ] $reps{ $chr }{ $s }[ + REP ]\n"; } }` [download] Will likely save you ~25% of your memory usage and run a little faster. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 27, 2011 at 12:23 UTC
Cheers BrowserUK, I'll definitely give that a go. Out of interest is there an advantage to using the constant for END and REP rather than 0 and 1? I've not used that before. Many thanks for your help Rich	[reply]
Re^3: Reducing memory footprint when doing a lookup of millions of coordinates by BrowserUk (Patriarch) on Feb 27, 2011 at 12:40 UTC
is there an advantage to using the constant for END and REP rather than 0 and 1? Beyond a little extra clarity, no. I did it to make the two versions visibly comparible. There might be some extra memory savings to be had if you could give a clearer idea of the numbers involved. Ie. How many chromosomes? Approximate maximum lengths of both the chromosomes and the ranges? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 27, 2011 at 13:16 UTC
Re^5: Reducing memory footprint when doing a lookup of millions of coordinates by BrowserUk (Patriarch) on Feb 27, 2011 at 13:40 UTC
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 27, 2011 at 12:26 UTC
By the way. In answer to your question, in this case this shouldn't happen but given the size of the data you're right that it's better to be safe than sorry. Rich	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by mellon85 (Monk) on Feb 27, 2011 at 10:24 UTC
Never considered using a database instead of loading it all in memory? it may not be as fast as it is now (except for the loading time) but the memory footprint IMHO can be heavily reduced	[reply]
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 27, 2011 at 12:15 UTC
Hi mellon85 and moritz, Thank you for your replies. I definitely like the idea of the database. I think this will work very well in the context of the rest of the program I'm writing. I'm thinking I could distribute the SQLite database file, preloaded with the feature data and indexed, with the software and query it when needed! Many thanks, Rich	[reply]
Re^3: Reducing memory footprint when doing a lookup of millions of coordinates by admiral_grinder (Pilgrim) on Feb 28, 2011 at 20:58 UTC
The database is a really good idea. I doubt that SQLite keeps the whole database file in memory itself. Could even locate the file to ram disk or similar as a speed bonus. You can still use the same code though to do the lookups (with minor modifications): Get a unique list of all chr. Foreach chr retrieve all entires (using retrieve_hashref) use existing code (but work on hashrefs).	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by JavaFan (Canon) on Feb 27, 2011 at 14:39 UTC
a) Is there a better way to do this? b) Can the memory footprint be reduced any? Two very different questions. To answer the second question first, an easy way to reduce the memory footprint is for each query, read in the file line by line, and report if it overlaps. I bet you are now saying "but that's too slow". With many problems, there's a trade-off between memory usage, and processing time. Reduce the memory usage, and the processing time goes up. Just asking for "reduce the memory usage" without saying anything about processing time may not get the answer you are looking for. As for you first question, it depends. What is "better" in your opinion? My opinion of better is to reduce query time, and invest in memory and preprocessing time - build a segment or interval tree and do queries against that. But that will increase your memory usage, so it's probably not better for you.	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by umasuresh (Hermit) on Feb 27, 2011 at 14:10 UTC
Hi Rich, Take a look at my experience with developing an algorithm to deal with something similar Re: Algorithms. HTH! Uma UPDATE : In this task, I did exactly what moritz suggests, I grouped the search chromosome-wise and submitted 24 jobs to a cluster!	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by nikosv (Deacon) on Feb 27, 2011 at 16:29 UTC
~~Maybe this is a case for Memoization~~ In computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously-processed inputs The Memoize module makes it easy to use by wrapping around your function. ~~It might increase the memory footprint but make the processing faster so it better to benchmark it,and if you do use it afterall I would be interested in hearing the test results.~~ Memoization is not suitable for your project. I've read your question hasty and misinterpreted it	[reply]
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Feb 28, 2011 at 08:33 UTC
Wow, I've definitely got plenty to work with here. Thank you all for your help. Kind Regards, Rich	[reply]
Re^3: Reducing memory footprint when doing a lookup of millions of coordinates by BrowserUk (Patriarch) on Feb 28, 2011 at 13:05 UTC
Be aware that memoisation will not help your problem at all. Memoisation speeds up repetitive calculations by caching the results of those calculations to avoid recalculating them. It uses (often prodigious amounts of) memory to gain speed. Since your stated goal is to reduce memory usage; since you have no repetitive calculations; any attempt to use the linked module could not help your performance and would increase your memory footprint, probably to the point of out-of-memory failure. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis (Beadle) on Mar 01, 2011 at 01:13 UTC
Re: Reducing memory footprint when doing a lookup of millions of coordinates by tospo (Hermit) on Feb 28, 2011 at 09:50 UTC
there are quite a few database solutions for this problem. SQL itself isn't that great for range queries but there are binning schemes that basically extent the idea of doing the query by chromosome and narrow down the query into a region. Check out Bio::DB::GFF for example. The underlying binning scheme is described in this 2002 paper by Lincoln Stein et al.	[reply]
Re^2: Reducing memory footprint when doing a lookup of millions of coordinates by remiah (Hermit) on Mar 01, 2011 at 02:48 UTC
Postgres may not be portable than SQLite. Postgres has geometric data type and functions, and if you take begin and end for a box of no height on the x axis, Postgres's box operator will work. For exmaple, Begin=100 and End=200 record is coordinate of ((100,0),(200,0)). http://www.postgresql.org/docs/8.2/static/functions-geometry.html Your sample data of 5301598 records took me an hour to load into default installation of Postgres, but it gives me very quick response. ###SQL### `drop table chromosomes; create table chromosomes( score integer ,div numeric(6,1) ,del numeric(6,1) ,ins numeric(6,1) ,q_sequence text --,begin1 integer --,end1 integer ,pos box ,left_paren text ,repeat1 text ,repeat2 text ,class_family text ,begin2 text ,end2 text ,left_paren2 text ,ID integer ); create index idx_chromosomes_1 on chromosomes(q_sequence); create index idx_chromosomes_2 on chromosomes using gist(pos);` [download] ###PERL### use strict; use warnings; use DBI; my($dbh); system('date'); $dbh = DBI->connect('dbi:Pg:dbname=test1','','',{AutoCommit=>0}) or die $dbh->errstr; #populate($dbh); select_test($dbh); $dbh->disconnect; system('date'); exit 0; sub populate{ my($dbh)=@_; my($sth,$sql,$cnt); $cnt=0; $sql="insert into chromosomes values (?,?,?,?,?,?,?,?,?,?,?,?, +?,?)"; $sth=$dbh->prepare($sql) or die $dbh->errstr; eval{ open my $fh , "<", "hg19.fa.out" or die $!; while(my $line=<$fh>){ $cnt++; next if $cnt <= 3; chomp $line; $line =~ s/^\s+//; my @vals=split(/\s+/, $line); #begin and end to box type $vals[5]="(($vals[5],0),($vals[6],0))"; splice(@vals,6,1); $sth->execute(@vals); } close $fh; $sth->finish; }; if($@){ $dbh->rollback; }else { $dbh->commit; } } sub select_test{ my($dbh)=@_; my($sth,$sql); #SQL for overlaps #$sql="select * from chromosomes where pos && box '((10000,0), +(20000,0))'"; #SQL for include $sql="select * from chromosomes where pos @ box '((10000,0),(2 +0000,0))'"; $sth=$dbh->prepare($sql) or die $dbh->errstr; $sth->execute or die $dbh->errstr; while(my $href= $sth->fetchrow_hashref){ print $href->{pos} . "\n"; } $sth->finish; } [download]	[reply] [d/l] [select]
Re^3: Reducing memory footprint when doing a lookup of millions of coordinates by tospo (Hermit) on Mar 01, 2011 at 13:00 UTC
Interesting, I wasn't aware of these (I'm not using PostgreSQL very much).	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by BrowserUk (Patriarch) on Mar 01, 2011 at 03:24 UTC
If it helps and/or you really want to you can get this file from HERE but you'll have to change the column numbers to pull out the right fields! Which field or fields in the linked file constitute the "feature" field in your example? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Reducing memory footprint when doing a lookup of millions of coordinates by repellent (Priest) on Mar 04, 2011 at 06:59 UTC
On my measly VM (256MB memory) running PostgreSQL 8.2 on FreeBSD 6.4, it took less than 8 minutes to load the entire file into a table. Then, given: `chromosome = chr2 starting = 21000 ending = 23200` [download] returns the overlaps in less than half a second. $ time perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.o +ut \| psql -qc " CREATE TABLE tt (chromosome text, starting integer, ending integer, +feature text); COPY tt (chromosome, starting, ending, feature) FROM STDIN WITH CSV; CREATE INDEX idx_tt_chromosome ON tt (chromosome); " perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.out 67. +31s user 82.26s system 71% cpu 3:29.33 total psql -qc 0.27s user 5.68s system 2% cpu 4:15.92 total pgsql=> EXPLAIN ANALYZE pgsql-> SELECT * pgsql-> FROM pgsql-> tt pgsql-> WHERE pgsql-> chromosome = 'chr2' pgsql-> AND 21000 <= ending pgsql-> AND starting <= 23200; QUERY P +LAN ---------------------------------------------------------------------- +------------------------------------------------------------------- Bitmap Heap Scan on tt (cost=516.43..38178.20 rows=2945 width=72) (a +ctual time=129.490..341.722 rows=2 loops=1) Recheck Cond: (chromosome = 'chr2'::text) Filter: ((21000 <= ending) AND (starting <= 23200)) -> Bitmap Index Scan on idx_tt_chromosome (cost=0.00..515.69 rows +=26508 width=0) (actual time=128.165..128.165 rows=414622 loops=1) Index Cond: (chromosome = 'chr2'::text) Total runtime: 341.955 ms pgsql=> SELECT * pgsql-> FROM pgsql-> tt pgsql-> WHERE pgsql-> chromosome = 'chr2' pgsql-> AND 21000 <= ending pgsql-> AND starting <= 23200; chromosome \| starting \| ending \| feature ------------+----------+--------+--------- chr2 \| 21258 \| 21370 \| MIRb chr2 \| 22095 \| 23394 \| L1PA14 (2 rows) [download]	[reply] [d/l] [select]