in reply to Efficient way to handle huge number of records?
On a slow desktop (which is also busy doing other database stuff), I get the below performance with PostgreSQL (9.2devel git/HEAD).
With 10,000,000-sequence fasta was made (with BrowserUK's script, adapted - thanks). This yielded a 1.2G junk.dat file:
-rw-rw-r-- 1 aardvark aardvark 1.2G Dec 11 15:13 junk.dat grep -c '^>' junk.dat 10000000 # one-off: create database table 'junk': time < junk.dat psql -c "drop table if exists junk; create table junk(accession text, sequence text); copy junk from stdin csv delimiter E'\t';" ; real 1m35.215s # one-off: create index on database table 'junk': time echo " create index junk_idx on junk (accession); analyze junk;" | psql real 7m58.633s
Now the interesting part: retrieval (I don't know why others are not posting retrieval results?).
I retrieved 10 random rows, and did that 4 times in succession, to show caching effect.
name protein (or dna?) retrieval time >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 160.624 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 124.207 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 46.639 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 74.509 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 83.611 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 69.637 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 37.553 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 94.236 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 56.977 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 50.896 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 3.799 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 2.029 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.943 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.883 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.046 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.529 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.863 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.952 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 2.042 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.011 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 2.632 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.992 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.889 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 5.196 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.091 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.177 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.877 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 2.092 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 12.836 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 1.867 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 1.965 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.929 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.857 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.999 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.009 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 1.981 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.865 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.880 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 1.878 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.158 ms
That's not so bad, is it? (And remember this desktop (2GB, AMD Athlon(tm) II 160u, 800Hz -- yes it's slow) is in the background doing searches (indexed, but also full table scans) of uniprot, which, as you perhaps know, has more than 18M records.
(If you cannot get mysql installed all the above may also be a bit over your head; nevertheless you should consider to learn database work -- it will often come in handy in bioinformatics. And since I've now done the work, I'll post it, for what it's worth ;) )
(I may update the timings when the machine becomes otherwise idle)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Efficient way to handle huge number of records?
by BrowserUk (Patriarch) on Dec 11, 2011 at 22:26 UTC | |
|
Re^2: Efficient way to handle huge number of records?
by flexvault (Monsignor) on Dec 11, 2011 at 21:40 UTC |