An interesting question. Some slightly different ideas from the excellent ones already given.

Depending on the total number of accesses wrt to the total number of lines (Dave's question to which you didn't really answer), I would build a full index (ikegami's idea) or a shallow index.

with 500 GB and an average line of 1000 bytes say, you still get a huge number of lines 500m (5*10^8), so a full index would be 500m * 4 bytes = 2GB file. A shallow index of say one entry every 50 would occupy only 2*10^9*2/100 i.e 4*10^7 = 40M a resonable number. To go to line n would mean seeking to position int(n / 50) * 4, read the offset and then seek k times the EOL marker (which implements essentially Tie::File logic). A shallow index is interesting when the number of accesses is much less than the total of lines of the main file.

One other idea is to have a couple processes (or more). One would be a say daemon listening on a given port in charge of calculating the actual index based on the shallow index file, in charge of randomness, and eventually giving back a few lines. A simple protocol could be: send index and receive lines, or send number of lines and receive them. If you can arrange having the same big data file on different partitions with different disk controllers you could afford a process per disk say. The second (and main) process would be in charge of the analisis only. You could also implement recording of a session this way, round-robin caching could be a nice optimization.

cheers --stephan

In reply to Re: Help performing "random" access on very very large file by sgt
in thread Help performing "random" access on very very large file by downer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.