Re: Reducing memory footprint when doing a lookup of millions of coordinates
by GrandFather (Saint) on Feb 27, 2011 at 10:29 UTC
|
You could turn the problem inside out: load the test values into memory then scan the large reference file one line at a time to perform the matching:
#!/usr/bin/perl
use strict;
my $reps = <<REPS;
chr1 100 120 feature1
chr1 200 250 feature2
chr2 150 200 feature1
chr2 280 350 feature1
chr3 100 150 feature2
chr3 300 450 feature2
REPS
my %tests;
while (my $line = <DATA>) {
$line =~ s/[\n\r]//g;
my @array = split /\s+/, $line;
$tests{$array[0]}{$array[1]}{'end'} = $array[2];
$tests{$array[0]}{$array[1]}{'rep'} = $array[3];
}
open my $repIn, '<', \$reps;
while (<$repIn>) {
my ($chr, $start, $end, $rep) = split ' ';
next if !exists $tests{$chr};
for my $s (keys %{$tests{$chr}}) {
if ($start <= $tests{$chr}{$s}{'end'}) {
last if $s >= $end;
print "$chr $start $end $rep\n";
}
}
}
__DATA__
chr2 160 210
True laziness is hard work
| [reply] [d/l] |
|
|
Hi GrandFather,
You wont believe me but I thought about this after I posted but I haven't tested it yet! If the database idea proves problematic I think this is the way to go.
Many thanks for your help and the code to help me out.
Rich
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by moritz (Cardinal) on Feb 27, 2011 at 10:16 UTC
|
One improvement that should be straight forward to implement is to group everything by chromosome first.
So far example read only those records from the file that are related to chr1. Build your %reps hash from them. Then search for all those coordinates on chr1 that you find interesting.
Then empty the hash with %reps = () to save memory, and read all chr2 coordinates.
Other possible improvements: instead of using a hash for the coordinates, use two arrays (for begin and end), and then do a binary search in them.
Of course a relational database like sqlite or postgres can do all these things for you if you build appropriate indexes over the coloumns.
| [reply] [d/l] [select] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by BrowserUk (Patriarch) on Feb 27, 2011 at 10:31 UTC
|
First, a question. Is it possible to have two (or more) features that start at the same position in the same chromosome, but end at different positions?
If so, your current data structure will only record the last one read from the file.
Assuming that's okay, then moving from using a HoHoHs to a HoHoAs:
#!/usr/bin/perl -slw
use strict;
use constant { END => 0, REP => 1 };
my %reps;
while(<>){
chomp;
my @array = split;
$reps{ $array[0] }{ $array[1] } = [ $array[2], $array[3] ];
}
my $start = 160;
my $end = 210;
my $chr = "chr2";
for my $s ( sort { $a <=> $b } keys %{ $reps{ $chr } } ){
if( $start <= $reps{ $chr }{ $s }[ END ] ) {
last if $s >= $end;
print "$chr $s $reps{ $chr }{ $s }[ END ] $reps{ $chr }{ $s }[
+ REP ]\n";
}
}
Will likely save you ~25% of your memory usage and run a little faster.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Cheers BrowserUK,
I'll definitely give that a go.
Out of interest is there an advantage to using the constant for END and REP rather than 0 and 1? I've not used that before.
Many thanks for your help
Rich
| [reply] |
|
|
is there an advantage to using the constant for END and REP rather than 0 and 1?
Beyond a little extra clarity, no. I did it to make the two versions visibly comparible.
There might be some extra memory savings to be had if you could give a clearer idea of the numbers involved.
Ie. How many chromosomes? Approximate maximum lengths of both the chromosomes and the ranges?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
|
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by mellon85 (Monk) on Feb 27, 2011 at 10:24 UTC
|
Never considered using a database instead of loading it all in memory? it may not be as fast as it is now (except for the loading time) but the memory footprint IMHO can be heavily reduced
| [reply] |
|
|
Hi mellon85 and moritz,
Thank you for your replies.
I definitely like the idea of the database. I think this will work very well in the context of the rest of the program I'm writing.
I'm thinking I could distribute the SQLite database file, preloaded with the feature data and indexed, with the software and query it when needed!
Many thanks,
Rich
| [reply] |
|
|
The database is a really good idea. I doubt that SQLite keeps the whole database file in memory itself. Could even locate the file to ram disk or similar as a speed bonus.
You can still use the same code though to do the lookups (with minor modifications):
- Get a unique list of all chr.
- Foreach chr retrieve all entires (using retrieve_hashref)
- use existing code (but work on hashrefs).
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by JavaFan (Canon) on Feb 27, 2011 at 14:39 UTC
|
a) Is there a better way to do this?
b) Can the memory footprint be reduced any?
Two very different questions. To answer the second question first, an easy way to reduce the memory footprint is for each query, read in the file line by line, and report if it overlaps. I bet you are now saying "but that's too slow". With many problems, there's a trade-off between memory usage, and processing time. Reduce the memory usage, and the processing time goes up. Just asking for "reduce the memory usage" without saying anything about processing time may not get the answer you are looking for.
As for you first question, it depends. What is "better" in your opinion? *My* opinion of better is to reduce query time, and invest in memory and preprocessing time - build a segment or interval tree and do queries against that. But that will increase your memory usage, so it's probably not better for you.
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by umasuresh (Hermit) on Feb 27, 2011 at 14:10 UTC
|
Hi Rich,
Take a look at my experience with developing an algorithm to deal with something similar Re: Algorithms.
HTH!
Uma
UPDATE :
In this task, I did exactly what moritz suggests, I grouped the search chromosome-wise and submitted 24 jobs to a cluster! | [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by nikosv (Deacon) on Feb 27, 2011 at 16:29 UTC
|
Maybe this is a case for Memoization
In computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously-processed inputs
The Memoize module makes it easy to use by wrapping around your function.
It might increase the memory footprint but make the processing faster so it better to benchmark it,and if you do use it afterall I would be interested in hearing the test results.
Memoization is not suitable for your project. I've read your question hasty and misinterpreted it
| [reply] |
|
|
| [reply] |
|
|
Be aware that memoisation will not help your problem at all.
Memoisation speeds up repetitive calculations by caching the results of those calculations to avoid recalculating them. It uses (often prodigious amounts of) memory to gain speed.
Since your stated goal is to reduce memory usage; since you have no repetitive calculations; any attempt to use the linked module could not help your performance and would increase your memory footprint, probably to the point of out-of-memory failure.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by tospo (Hermit) on Feb 28, 2011 at 09:50 UTC
|
there are quite a few database solutions for this problem. SQL itself isn't that great for range queries but there are binning schemes that basically extent the idea of doing the query by chromosome and narrow down the query into a region. Check out Bio::DB::GFF for example. The underlying binning scheme is described in this 2002 paper by Lincoln Stein et al. | [reply] |
|
|
Postgres may not be portable than SQLite. Postgres has geometric data type and functions, and if you take begin and end for a box of no height on the x axis, Postgres's box operator will work. For exmaple, Begin=100 and End=200 record is coordinate of ((100,0),(200,0)).
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
Your sample data of 5301598 records took me an hour to load into default installation of Postgres, but it gives me very quick response.
###SQL###
drop table chromosomes;
create table chromosomes(
score integer
,div numeric(6,1)
,del numeric(6,1)
,ins numeric(6,1)
,q_sequence text
--,begin1 integer
--,end1 integer
,pos box
,left_paren text
,repeat1 text
,repeat2 text
,class_family text
,begin2 text
,end2 text
,left_paren2 text
,ID integer
);
create index idx_chromosomes_1 on chromosomes(q_sequence);
create index idx_chromosomes_2 on chromosomes using gist(pos);
###PERL###
use strict;
use warnings;
use DBI;
my($dbh);
system('date');
$dbh = DBI->connect('dbi:Pg:dbname=test1','','',{AutoCommit=>0})
or die $dbh->errstr;
#populate($dbh);
select_test($dbh);
$dbh->disconnect;
system('date');
exit 0;
sub populate{
my($dbh)=@_;
my($sth,$sql,$cnt);
$cnt=0;
$sql="insert into chromosomes values (?,?,?,?,?,?,?,?,?,?,?,?,
+?,?)";
$sth=$dbh->prepare($sql) or die $dbh->errstr;
eval{
open my $fh , "<", "hg19.fa.out" or die $!;
while(my $line=<$fh>){
$cnt++;
next if $cnt <= 3;
chomp $line;
$line =~ s/^\s+//;
my @vals=split(/\s+/, $line);
#begin and end to box type
$vals[5]="(($vals[5],0),($vals[6],0))";
splice(@vals,6,1);
$sth->execute(@vals);
}
close $fh;
$sth->finish;
};
if($@){
$dbh->rollback;
}else {
$dbh->commit;
}
}
sub select_test{
my($dbh)=@_;
my($sth,$sql);
#SQL for overlaps
#$sql="select * from chromosomes where pos && box '((10000,0),
+(20000,0))'";
#SQL for include
$sql="select * from chromosomes where pos @ box '((10000,0),(2
+0000,0))'";
$sth=$dbh->prepare($sql) or die $dbh->errstr;
$sth->execute or die $dbh->errstr;
while(my $href= $sth->fetchrow_hashref){
print $href->{pos} . "\n";
}
$sth->finish;
}
| [reply] [d/l] [select] |
|
|
Interesting, I wasn't aware of these (I'm not using PostgreSQL very much).
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by BrowserUk (Patriarch) on Mar 01, 2011 at 03:24 UTC
|
| [reply] |
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by repellent (Priest) on Mar 04, 2011 at 06:59 UTC
|
On my measly VM (256MB memory) running PostgreSQL 8.2 on FreeBSD 6.4, it took less than 8 minutes to load the entire file into a table. Then, given:
chromosome = chr2
starting = 21000
ending = 23200
returns the overlaps in less than half a second.
$ time perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.o
+ut | psql -qc "
CREATE TABLE tt (chromosome text, starting integer, ending integer,
+feature text);
COPY tt (chromosome, starting, ending, feature) FROM STDIN WITH CSV;
CREATE INDEX idx_tt_chromosome ON tt (chromosome);
"
perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.out 67.
+31s user 82.26s system 71% cpu 3:29.33 total
psql -qc 0.27s user 5.68s system 2% cpu 4:15.92 total
pgsql=> EXPLAIN ANALYZE
pgsql-> SELECT *
pgsql-> FROM
pgsql-> tt
pgsql-> WHERE
pgsql-> chromosome = 'chr2'
pgsql-> AND 21000 <= ending
pgsql-> AND starting <= 23200;
QUERY P
+LAN
----------------------------------------------------------------------
+-------------------------------------------------------------------
Bitmap Heap Scan on tt (cost=516.43..38178.20 rows=2945 width=72) (a
+ctual time=129.490..341.722 rows=2 loops=1)
Recheck Cond: (chromosome = 'chr2'::text)
Filter: ((21000 <= ending) AND (starting <= 23200))
-> Bitmap Index Scan on idx_tt_chromosome (cost=0.00..515.69 rows
+=26508 width=0) (actual time=128.165..128.165 rows=414622 loops=1)
Index Cond: (chromosome = 'chr2'::text)
Total runtime: 341.955 ms
pgsql=> SELECT *
pgsql-> FROM
pgsql-> tt
pgsql-> WHERE
pgsql-> chromosome = 'chr2'
pgsql-> AND 21000 <= ending
pgsql-> AND starting <= 23200;
chromosome | starting | ending | feature
------------+----------+--------+---------
chr2 | 21258 | 21370 | MIRb
chr2 | 22095 | 23394 | L1PA14
(2 rows)
| [reply] [d/l] [select] |