Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Using indexing for faster lookup in large file

by erix (Prior)
on Feb 28, 2015 at 02:25 UTC ( [id://1118150]=note: print w/replies, xml ) Need Help??


in reply to Using indexing for faster lookup in large file

I think only the first two numbers actually matter because it seems clear from your example data that the second number is always the NCBI taxonomy ID (tax_id) [1]. The whole string behind that tax_id number follows from it.

So you'd have to compile first a list/table of unique taxonomy lines (IIRC there are less then 2 million in NCBI Taxonomy database; of course I don't know how many there will be in your file) with tax_id as a primary key. Then make a second list/table with just the first and second number of each line. (I'd try it out but with only the 200-line file that doesn't make much sense)

The two tables (always assuming you store them in a RDBMS) can then be joined on tax_id.

Of course, if you don't expect memory problems the same thing can be done in hashes as well.

Alternatively, you could make a table with your query numbers (what the hell are these numbers anyway?) together with line offsets (i.e., a variant of BrowserUK's solution). As always, storing the values and offsets in a dbms/table will be slower to search than searching them in a hash but it will be less dependent on having enough memory.

[1] NCBI Taxonomy page: http://www.ncbi.nlm.nih.gov/taxonomy (there is also a ftp link there but the files provided are not in the form of your nice human-readable taxonomy hierarchy-enumerating lines, so you'd have to compile such lines from that data; it seems easier to get them from your own database file.)

Replies are listed 'Best First'.
Re^2: Using indexing for faster lookup in large file
by Your Mother (Archbishop) on Feb 28, 2015 at 06:24 UTC

    I thought it might be from MeSH at first but after seeing more data, I suspect you’re right. The code I provided, I think, is better (with whatever tweaks the user/dev needs) for search than RMDBS RDBMS code and certainly faster.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1118150]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-18 20:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found