comment on

On my measly VM (256MB memory) running PostgreSQL 8.2 on FreeBSD 6.4, it took less than 8 minutes to load the entire file into a table. Then, given:

  chromosome = chr2
  starting   = 21000
  ending     = 23200
[download]

returns the overlaps in less than half a second.

$ time perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.o
+ut | psql -qc "
  CREATE TABLE tt (chromosome text, starting integer, ending integer, 
+feature text);
  COPY tt (chromosome, starting, ending, feature) FROM STDIN WITH CSV;
  CREATE INDEX idx_tt_chromosome ON tt (chromosome);
"
perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.out  67.
+31s user 82.26s system 71% cpu 3:29.33 total
psql -qc   0.27s user 5.68s system 2% cpu 4:15.92 total


pgsql=> EXPLAIN ANALYZE
pgsql-> SELECT *
pgsql-> FROM
pgsql->   tt                                            
pgsql-> WHERE
pgsql->   chromosome = 'chr2'
pgsql->   AND 21000 <= ending
pgsql->   AND starting <= 23200;
                                                               QUERY P
+LAN                                                                
----------------------------------------------------------------------
+-------------------------------------------------------------------
 Bitmap Heap Scan on tt  (cost=516.43..38178.20 rows=2945 width=72) (a
+ctual time=129.490..341.722 rows=2 loops=1)
   Recheck Cond: (chromosome = 'chr2'::text)
   Filter: ((21000 <= ending) AND (starting <= 23200))
   ->  Bitmap Index Scan on idx_tt_chromosome  (cost=0.00..515.69 rows
+=26508 width=0) (actual time=128.165..128.165 rows=414622 loops=1)
         Index Cond: (chromosome = 'chr2'::text)
 Total runtime: 341.955 ms


pgsql=> SELECT *
pgsql-> FROM
pgsql->   tt                                            
pgsql-> WHERE
pgsql->   chromosome = 'chr2'
pgsql->   AND 21000 <= ending
pgsql->   AND starting <= 23200;

 chromosome | starting | ending | feature 
------------+----------+--------+---------
 chr2       |    21258 |  21370 | MIRb
 chr2       |    22095 |  23394 | L1PA14
(2 rows)
[download]

In reply to Re: Reducing memory footprint when doing a lookup of millions of coordinates by repellent
in thread Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.