more useful options | |
PerlMonks |
Re: Using indexing for faster lookup in large fileby erix (Prior) |
on Feb 28, 2015 at 02:25 UTC ( [id://1118150]=note: print w/replies, xml ) | Need Help?? |
I think only the first two numbers actually matter because it seems clear from your example data that the second number is always the NCBI taxonomy ID (tax_id) [1]. The whole string behind that tax_id number follows from it. So you'd have to compile first a list/table of unique taxonomy lines (IIRC there are less then 2 million in NCBI Taxonomy database; of course I don't know how many there will be in your file) with tax_id as a primary key. Then make a second list/table with just the first and second number of each line. (I'd try it out but with only the 200-line file that doesn't make much sense) The two tables (always assuming you store them in a RDBMS) can then be joined on tax_id. Of course, if you don't expect memory problems the same thing can be done in hashes as well. Alternatively, you could make a table with your query numbers (what the hell are these numbers anyway?) together with line offsets (i.e., a variant of BrowserUK's solution). As always, storing the values and offsets in a dbms/table will be slower to search than searching them in a hash but it will be less dependent on having enough memory. [1] NCBI Taxonomy page: http://www.ncbi.nlm.nih.gov/taxonomy (there is also a ftp link there but the files provided are not in the form of your nice human-readable taxonomy hierarchy-enumerating lines, so you'd have to compile such lines from that data; it seems easier to get them from your own database file.)
In Section
Seekers of Perl Wisdom
|
|