Re: Help performing "random" access on very very large file

I would go through the file once in order to create an auxiliary data structure holding the byte offsets to the individual lines. As that data structure will probably still be too large to fit in memory (depending on the average line length), you'll want to store that in a file — but this time with fixed record size, so you don't have the same problem as with the original file...

You can then access a random line like this: multiply the line number with the record size, seek to that position in the auxiliary file, read the byte offset found there, and finally seek to that position in the original file.

Comment on Re: Help performing "random" access on very very large file

Replies are listed 'Best First'.
Re^2: Help performing "random" access on very very large file by downer (Monk) on Jul 16, 2007 at 14:44 UTC
I believe I could possibly do both; make my 1 big file into say, 1000 smaller files, and then find the byte offsets and create additional data structures for each. This way I could also spread the files across several disks, allowing the random accesses to be shared. When I write some code, i'll post it up! one problem is just getting an exact number for the line count of the big file, wc -l is taking forever! I just looked at the size of the file, and number of bytes for a small sample of the file to get an estimate of the line count.	[reply]
Re^3: Help performing "random" access on very very large file by dsheroh (Monsignor) on Jul 16, 2007 at 15:01 UTC
I'm not sure you really need to know an exact line count to do this... Just put the first 1024 lines into the first file, the next 1024 into the second file, etc. and end up with as many files as you end up with, using the disks round-robin to distribute them as evenly as possible. (The number of lines per file is arbitrary, of course, but I'm guessing 1024 would provide a manageable number of files and powers of 2 have the nice property of allowing you to just do `$line_number >> 10` to determine the file to use instead of requiring the CPU to do actual division.)	[reply] [d/l]