Re: Efficient way to handle huge number of records?

On a slow desktop (which is also busy doing other database stuff), I get the below performance with PostgreSQL (9.2devel git/HEAD).

With 10,000,000-sequence fasta was made (with BrowserUK's script, adapted - thanks). This yielded a 1.2G junk.dat file:

-rw-rw-r-- 1 aardvark aardvark 1.2G Dec 11 15:13 junk.dat

grep -c '^>' junk.dat 
10000000

#  one-off: create database table 'junk':

time < junk.dat psql -c "drop table if exists junk;
  create table junk(accession text, sequence text);
  copy junk from stdin csv delimiter E'\t';" ;

real    1m35.215s

# one-off: create index on database table 'junk':

time echo "
  create index junk_idx on junk (accession);
  analyze junk;" | psql

real    7m58.633s
[download]

Now the interesting part: retrieval (I don't know why others are not posting retrieval results?).

I retrieved 10 random rows, and did that 4 times in succession, to show caching effect.


       name protein (or dna?)             retrieval time

>ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 160.624 ms
>braniambsateowfjdmbgsmflbsosllhzureeponx Time: 124.207 ms
>aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 46.639 ms
>sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 74.509 ms
>pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 83.611 ms
>mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 69.637 ms
>dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 37.553 ms
>ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 94.236 ms
>thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 56.977 ms
>qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 50.896 ms

>ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 3.799 ms
>braniambsateowfjdmbgsmflbsosllhzureeponx Time: 2.029 ms
>aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.943 ms
>sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.883 ms
>pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.046 ms
>mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.529 ms
>dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.863 ms
>ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.952 ms
>thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 2.042 ms
>qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.011 ms

>ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 2.632 ms
>braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.992 ms
>aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.889 ms
>sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 5.196 ms
>pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.091 ms
>mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.177 ms
>dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.877 ms
>ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 2.092 ms
>thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 12.836 ms
>qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 1.867 ms

>ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 1.965 ms
>braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.929 ms
>aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.857 ms
>sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.999 ms
>pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.009 ms
>mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 1.981 ms
>dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.865 ms
>ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.880 ms
>thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 1.878 ms
>qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.158 ms
[download]

That's not so bad, is it? (And remember this desktop (2GB, AMD Athlon(tm) II 160u, 800Hz -- yes it's slow) is in the background doing searches (indexed, but also full table scans) of uniprot, which, as you perhaps know, has more than 18M records.

(If you cannot get mysql installed all the above may also be a bit over your head; nevertheless you should consider to learn database work -- it will often come in handy in bioinformatics. And since I've now done the work, I'll post it, for what it's worth ;) )

(I may update the timings when the machine becomes otherwise idle)

Comment on Re: Efficient way to handle huge number of records? Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records? by BrowserUk (Patriarch) on Dec 11, 2011 at 22:26 UTC
Only 5 million records because I have an image analysis that has been running for 2 days now using over half my memory and I don't want to interrupt it. I retrieved the first 1000 records from the hash -- as near random as makes no difference -- because 10 wasn't representative. Retrieval time is 0.4 millisecs per. The code: `#! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 1000; my $start = time; my %dna; keys %dna = 5e6; $dna{ <> } = <> until eof(); printf "Loading %d records took %d seconds\n", scalar keys %dna, time() - $start; $start = time(); my( $id, $seq ) = each %dna for 1 .. $N; printf "Took %.6f seconds per retrieval\n", ( time() - $start ) / $N; __END__ [22:14:18.06] C:\test>junk48 -N=1e3 junk.dat Loading 5000000 records took 37 seconds Took 0.000414 seconds per retrieval` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: Efficient way to handle huge number of records? by flexvault (Monsignor) on Dec 11, 2011 at 21:40 UTC
erix `while ( $cnt < $howmany ) { my $st = $cnt % 58; my $chr = chr( 122-$st ); $key = "$chr-$cnt";` [download] If you notice in the code, I commented out 2 lines and changed the 3rd to "A-$cnt", since the poster didn't care about random. I do, since that affects some DBs. If you want, I can run it so that after 30 seconds, I select a record, and then after completion, I'll read it back 1 time, and then 10 times with timing. Is that what you want? The system is 2-core 2GB AMD, 2.5 or 2.8 MHz. When I get you the times, I'll find out exactly what it is, but to compare I only run on 1-core. Thank you "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]