Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Using indexing for faster lookup in large file

by roboticus (Chancellor)
on Feb 27, 2015 at 21:04 UTC ( [id://1118125]=note: print w/replies, xml ) Need Help??


in reply to Using indexing for faster lookup in large file

anli_:

If the data sample is representative (in that there's a lot of repetition in the data), then you may be able to condense it into a much smaller data structure. If you post a larger sample (200-ish lines in readmore tags) I'll take a quick look and see if I can come up with something.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

  • Comment on Re: Using indexing for faster lookup in large file

Replies are listed 'Best First'.
Re^2: Using indexing for faster lookup in large file
by anli_ (Novice) on Feb 27, 2015 at 23:12 UTC
    Hi, and thanks for taking the time.
    I've included 200 random lines below

      anli_:

      Considering that the bulk of your data appears to be the text representation of various paths through a taxonomy tree, I thought that you might be able to fit it all into memory (taking advantage of all the redundancy) if you built a pair of trees and connected them together at the leaves. For example, if your data looked like this:

      1;2;8;root;xyzzy;cat 1;2;5;root;xyzzy;dog 1;9;root;bird

      Then we could build an index tree (on top) and the taxonomy tree (below), tying them together (shown by vertical lines) like this:

      1 / \ / \ 2 \ --^- \ / \ \ 8 5 9 | | | cat dog bird \ / / xyzzy / \ / root

      If your tree is relatively shallow but broad, you should be able to save a considerable amount of space.

      Here's some code that builds the two trees and looks up some of the numeric keys to display the data. Let me know if it does the trick for you, I'd be interested in knowing how much memory it takes to hold your database.

      The taxonomy tree has parent links in it to let you get from the leaf to the root of the tree, and the traceback function will walk the tree back to the root for you.

      Update: I added a couple comments to the code to clarify it a little.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Hi. And thanks for taking the time.

        I think this is a nice approach, because like you say there is a lot of redundancy in the data.

        I'll try to have a look at this more later on, but as for now I went with the Lucy approach.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1118125]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2024-04-18 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found