If you can spend the time (and the I/O) for a complete read before you start to use the file, you do not actually need to create a complete index. You could have an index that indexes every 10 (or 100) lines, so that you
approximately know where you are reading, so you only have to read a small number of useless lines before the one you are actually looking for. This might be a viable solution.
If your lines have a line number that's univocally understandable, you could use a binary partition method to find your line in about log2(n) passes without using an index. You should:
- Get the total size of the file
- Read the first complete line that's about the half of the file
- If it's over the line you are looking for, repeat the process seeking the middle point between the middle of the file and the end of the file;
- if it's below, repeat seeking the middle point between 0 and the middle of the file
- if it's it, you return it and exit
You have to take care that if the line you are looking for does not exist in the file, you cannot loop indefinitely (maybe set a maximum number of steps before bailing out).
If the line length of the lines within the files is expected to be more or less even, you could also think about a prediction method to get the expected point. It goes like this:
- We read a line in the middle of the file
- We now know it's line number (say, 10038) and that it's located at offset 120046
- We expect the average line to be 11.9 chars long
- If we are looking for line 9500, we start seeking at offset 9500*11.9 = 113050 and the we adjust our estimate based on what we actually find
Of course all those things have to be thought of seeing the actual case.