two things I would try:

1) compression. If you have any experience with bit coding, try to shrink things down a bit. I guess the best way to do this (the only way i've tried) is with inline C. This is very often a way to make data sets much more manageable, though it requires some low level tinkering. Many simple compression algorithms like variable-byte and rice coding are pretty easy to implement (if I can do it, then anyone can!).

2) I dont think binary sort is a good solution here. True there are O(log(n)) lookups which seems good, but if each of these is a disk read, then this is very slow, especially if the distribution of these look ups is very slow and can't benefit from caching. Rather, I would do the following: before your operation, go through the list, using tell every so often, say, after the leading digit in the first column changes. Some way that you have a manageable number of rows. Now create a hash that stores the starting # of bytes offset into the file, and the number of bytes in length in a hash, you can use storable to save this. When looking up a key-value pair, query this hash, then slurp up the associated chunk of the file. (only one disk op) Since this chunk is still sorted you can use binary search, or since it's in memory, a linear scan may be fine too.

In reply to Re: fast lookups in files by downer
in thread fast lookups in files by citromatik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.