Re: Using indexing for faster lookup in large file

I think only the first two numbers actually matter because it seems clear from your example data that the second number is always the NCBI taxonomy ID (tax_id) [1]. The whole string behind that tax_id number follows from it.

So you'd have to compile first a list/table of unique taxonomy lines (IIRC there are less then 2 million in NCBI Taxonomy database; of course I don't know how many there will be in your file) with tax_id as a primary key. Then make a second list/table with just the first and second number of each line. (I'd try it out but with only the 200-line file that doesn't make much sense)

The two tables (always assuming you store them in a RDBMS) can then be joined on tax_id.

Of course, if you don't expect memory problems the same thing can be done in hashes as well.

Alternatively, you could make a table with your query numbers (what the hell are these numbers anyway?) together with line offsets (i.e., a variant of BrowserUK's solution). As always, storing the values and offsets in a dbms/table will be slower to search than searching them in a hash but it will be less dependent on having enough memory.

[1] NCBI Taxonomy page: http://www.ncbi.nlm.nih.gov/taxonomy (there is also a ftp link there but the files provided are not in the form of your nice human-readable taxonomy hierarchy-enumerating lines, so you'd have to compile such lines from that data; it seems easier to get them from your own database file.)

Comment on Re: Using indexing for faster lookup in large file Select or Download Code

Replies are listed 'Best First'.
Re^2: Using indexing for faster lookup in large file by Your Mother (Archbishop) on Feb 28, 2015 at 06:24 UTC
I thought it might be from MeSH at first but after seeing more data, I suspect you’re right. The code I provided, I think, is better (with whatever tweaks the user/dev needs) for search than ~~RMDBS~~ RDBMS code and certainly faster.	[reply]


more useful options
	PerlMonks