What would be the best way (defining best as the ratio between efficiency / simplicity of implementation) of doing this?

Is making a DB with the file or writing C subroutines to access the file the only way of improving this task?

The binary-search approach explained elsewhere in the thread seems like a reasonable approach, although I would be hesitant to go that route myself. If even loading an index swamps your machine then it seems like we are talking about serious quantities.

On that criterion alone, I would load it into a relational database and be done with it. At least then you don't have to worry about ensuring that the file is always sorted, which is one less thing to worry about when adding new keys.

Without presuming to know what you need this for, I can't help but thinking that if it were my baby, I know that sooner or later people would ask me questions like "how many keys have been used between 215000 and 220000. In that case the binary search algorithm doesn't buy you anything, but you get all that for free in database.

Be that as it may, if you really don't want to go to DB route, you can always apply a divide-and-conquer strategy to your file: take the first two (or last, or three, or multi-level) digits of the key, and write the key/value pairs out to a corresponding filename.

When you need to look up a key, figure out what file it would be located in, then slurp the entire file and do a pattern match with

my ($val) = ($_ =~ /\b$key\t(\d+)\b/);

That is, don't even bother reading the file line by line, that's too slow. Again, this is a clumsy approach: if your lookups are all over the search space, you'll spend all your time sucking up files. Then you start thinking about a cache of least-recently used files... nah, just put it in a database and be done with it.

• another intruder with the mooring in the heart of the Perl


In reply to Re: fast lookups in files by grinder
in thread fast lookups in files by citromatik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.