I see two possibilities for optimizing the speed of your program by reducing the number of file accesses it makes:

  1. Load your index into memory instead of reading it from disk every time. You do one seek call and one read call on your index file per line read - you can reduce that number to one read and one unpack overall, at the price of some memory. Also, you don't really need to store the offsets of each line but only the offsets of each 1024th line, or, if your program advances through the file anyway, only the offset after which you want to continue.
  2. Instead of seeking in your data file for every line, just seek once to your start point and then read the 1024 lines from that point. This will save you another call to seek for every line read.

In addition to these two points, you might want to consider if you actually need exactly 1024 lines per batch or if it is OK to use "roughly" 1024 lines per batch. Then you can simply read the first (say) 10_000 lines and use their average length to split up the file into batches of roughly 1024 lines. Whenever you end up in the middle of a line with the start of your batch, you move the start in the direction of the beginning of the file, and the same with the end position of your batch. This will save you the need of reading through the lines just for counting them, but that might or might not be an overall speed gain, since you will need to read the whole file line by line at least once anyway.


In reply to Re: Accessing files at certain line number by Corion
in thread Accessing files at certain line number by Utrecht

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.