in reply to Accessing files at certain line number

I see two possibilities for optimizing the speed of your program by reducing the number of file accesses it makes:

  1. Load your index into memory instead of reading it from disk every time. You do one seek call and one read call on your index file per line read - you can reduce that number to one read and one unpack overall, at the price of some memory. Also, you don't really need to store the offsets of each line but only the offsets of each 1024th line, or, if your program advances through the file anyway, only the offset after which you want to continue.
  2. Instead of seeking in your data file for every line, just seek once to your start point and then read the 1024 lines from that point. This will save you another call to seek for every line read.

In addition to these two points, you might want to consider if you actually need exactly 1024 lines per batch or if it is OK to use "roughly" 1024 lines per batch. Then you can simply read the first (say) 10_000 lines and use their average length to split up the file into batches of roughly 1024 lines. Whenever you end up in the middle of a line with the start of your batch, you move the start in the direction of the beginning of the file, and the same with the end position of your batch. This will save you the need of reading through the lines just for counting them, but that might or might not be an overall speed gain, since you will need to read the whole file line by line at least once anyway.