Help performing "random" access on very very large file

downer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help performing "random" access on very very large file by almut (Canon) on Jul 16, 2007 at 14:38 UTC
I would go through the file once in order to create an auxiliary data structure holding the byte offsets to the individual lines. As that data structure will probably still be too large to fit in memory (depending on the average line length), you'll want to store that in a file — but this time with fixed record size, so you don't have the same problem as with the original file... You can then access a random line like this: multiply the line number with the record size, seek to that position in the auxiliary file, read the byte offset found there, and finally seek to that position in the original file.	[reply]
Re^2: Help performing "random" access on very very large file by downer (Monk) on Jul 16, 2007 at 14:44 UTC
I believe I could possibly do both; make my 1 big file into say, 1000 smaller files, and then find the byte offsets and create additional data structures for each. This way I could also spread the files across several disks, allowing the random accesses to be shared. When I write some code, i'll post it up! one problem is just getting an exact number for the line count of the big file, wc -l is taking forever! I just looked at the size of the file, and number of bytes for a small sample of the file to get an estimate of the line count.	[reply]
Re^3: Help performing "random" access on very very large file by dsheroh (Monsignor) on Jul 16, 2007 at 15:01 UTC
I'm not sure you really need to know an exact line count to do this... Just put the first 1024 lines into the first file, the next 1024 into the second file, etc. and end up with as many files as you end up with, using the disks round-robin to distribute them as evenly as possible. (The number of lines per file is arbitrary, of course, but I'm guessing 1024 would provide a manageable number of files and powers of 2 have the nice property of allowing you to just do `$line_number >> 10` to determine the file to use instead of requiring the CPU to do actual division.)	[reply] [d/l]
Re: Help performing "random" access on very very large file by ikegami (Patriarch) on Jul 16, 2007 at 14:57 UTC
An option you didn't list is: Create an index. ( Oops, I guess that's an implementation of option 4. ) Basically, that's what Tie::File does, but Tie::File builds the index in memory whereas you'll be building it in a file. There are two advantages: Tie::File would easily require >1GB of memory in your situation, and if the index resides on disk, you build it once and use multiple times, as long as the data file doesn't chage. Read through the huge file writing the starting locations of every line into another file (in fixed-width binary, such as `pack('N2', $high, $low)` or `pack('Q', $addr)`). Then, you can seek to random (divisible by 4) locations into the index file to know where to seek into the data file. Note: Perl IO is supposedly quite slow — was this fixed? — so you could write a tiny C program to create the index file since that's pure IO and just as simple to write in C as in Perl. Note: This would require your Perl, its `seek` and its `tell` to handle 64-bit numbers.	[reply] [d/l] [select]
Re: Help performing "random" access on very very large file by rpanman (Scribe) on Jul 16, 2007 at 14:28 UTC
How about splitting the file down into to a number of smaller files each with a set number of lines on it. So you could end up with 1000 files of 500 lines (for example). Then you just need a mechanism for breaking a line number (in the 500GB file) into a filename and line number within that file, quicker access all round. The best way to eat an elephant is one bit at a time...	[reply]
Re: Help performing "random" access on very very large file by ctilmes (Vicar) on Jul 16, 2007 at 16:26 UTC
Displaying/buffering huge text files describes some methods for indexing text files for more fast/efficient jumping to a specific line.	[reply]
Re^2: Help performing "random" access on very very large file by downer (Monk) on Jul 16, 2007 at 17:44 UTC
they seem to have some good advice for indexing, I could probably just index every ~50 lines instead of every line, then, loading that block in memory.	[reply]
Re^3: Help performing "random" access on very very large file by BrowserUk (Patriarch) on Jul 16, 2007 at 20:32 UTC
One very effective optimisation for reducing the size of the index, is to store just lengths rather than full absolute offsets for most of the lines. Say for example, that you know that all the lines in your 500GB file are less than 255 characters long, and for calculation purposes, average 100 characters. Then if you build a full, absolute index using the pack 'd' template (required for anything over 4GB), then your index itself is going to be 500GB / 100 * 8 ~= 40 GB! However, if you store the absolute index of every (say) 1000th line, and 999 x 1-byte lengths for the intervening lines, the index size becomes: 500 GB / (1001000) 1007 ~= 5GB. Each record within your index is then a fixed 1007 bytes, and the process of locating any individual line becomes: `my( $recNo, $offset ) = ( int( $lineNo / 1000 ), ( $lineNo % 1000 ) - 1 );` `seek INDEX, $recno 1007, 0;` `read INDEX, $idxBuf, 1007;` `my $absOffset = unpack 'd', $idxBuf;` `my( $addOffset, $length ) = unpack "x[d] %32C$offset C", $idxBuf;` `seek BIGFILE, $absOffset + $addOffset, 0;` `read BIGFILE, $line, $length;` By using the 'checksum' unpack template (`%n-bits`) to perform the summing of the intermediate lengths, the calculation of the composite offset is really very fast. This puts any line of your huge file just two reads away. Note: There is nothing special about the number 1000 I've used in this example except that it is convenient for calculations. As we are calculating a 32-bit offset (`%32C`), and representing length with 8-bit unsigned chars, the theoretical maximum, absolute offset spacing is 232 / 28 == 224 == 16 MB. A quick test suggests that it takes Perl ~ 60 milliseconds to calculate the sum of 16 MB chars. That's far less time than having to read even 1 extra data record. However, there is little extra saving to be had by further decreasing the frequency of the absolute offsets. Using an absolute offset every 1000 lines gives an index size of: 5.035 GB The checksum time for 999 bytes: 112 microseconds. Using an absolute offset every 64k lines gives an index size of: 5.00053405761719 GB The checksum time for 64k bytes: 320 microseconds. Using an absolute offset every 16MB lines gives an index size of: 5.00000208616257 GB The checksum time for 16MB bytes: 60 milli*seconds. So, you see there is little extra index space saving by decreasing the frequency of absolute offsets, but there is a considerable calculation penalty in doing so. You also have to factor in the extra time taken to read the larger index records from the index file. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Help performing "random" access on very very large file by dave_the_m (Monsignor) on Jul 16, 2007 at 14:23 UTC
Well, some obvious questions are: how many times will you use the file (ie will you resuse the same file in later runs)? For each use, what percentage of the lines will you need to randomly seek to? How many times will you access each line? Dave.	[reply]
Re^2: Help performing "random" access on very very large file by downer (Monk) on Jul 16, 2007 at 14:29 UTC
I will be making a lot of accesses into the file (~5 million?) I know that some lines will be accessed more often than others, but as far as I can tell now, the accesses will be be at least initially uniform. Lines will be looked up more than once, so some smart caching would be beneficial, but until I start making the calculations, Its hard to tell which lines will be accessed more. Of course I have a few big disks, I can just make a few copies of the file and parallelize the task.	[reply]
Re: Help performing "random" access on very very large file by dsheroh (Monsignor) on Jul 16, 2007 at 14:50 UTC
Of the options you listed, padding all lines to a fixed length or making an index of offsets would almost certainly be the fastest. Since you said you have plenty of disk to burn, padding would probably be the faster of the two, since you'd only have to do one seek per line vs. doing a seek to find the offset in your index file/db and then a second seek in the actual data file, but this would come at the cost of increased preprocessing time since you'd need to make two passes over the data file (one to find the longest line and one to get the data for padding), plus you'd be rewriting the whole thing on the second pass. It sounds like you'll be using it enough for the gains in lookup time to outweigh the extra preprocessing time, though. And then there's the unlisted option 5: Stuff it into an actual database. They're designed to look up arbitrary records quickly, although that again gets into the question of whether you'll be reusing the same dataset often enough to be worth the time taken to insert and index all the data.	[reply]
Re: Help performing "random" access on very very large file by sgt (Deacon) on Jul 16, 2007 at 15:55 UTC
An interesting question. Some slightly different ideas from the excellent ones already given. Depending on the total number of accesses wrt to the total number of lines (Dave's question to which you didn't really answer), I would build a full index (ikegami's idea) or a shallow index. with 500 GB and an average line of 1000 bytes say, you still get a huge number of lines 500m (510^8), so a full index would be 500m 4 bytes = 2GB file. A shallow index of say one entry every 50 would occupy only 210^92/100 i.e 410^7 = 40M a resonable number. To go to line n would mean seeking to position int(n / 50) 4, read the offset and then seek k times the EOL marker (which implements essentially Tie::File logic). A shallow index is interesting when the number of accesses is much less than the total of lines of the main file. One other idea is to have a couple processes (or more). One would be a say daemon listening on a given port in charge of calculating the actual index based on the shallow index file, in charge of randomness, and eventually giving back a few lines. A simple protocol could be: send index and receive lines, or send number of lines and receive them. If you can arrange having the same big data file on different partitions with different disk controllers you could afford a process per disk say. The second (and main) process would be in charge of the analisis only. You could also implement recording of a session this way, round-robin caching could be a nice optimization. cheers --stephan	[reply]
Re: Help performing "random" access on very very large file by shmem (Chancellor) on Jul 16, 2007 at 14:43 UTC
I guess you don't have enough memory for using File::Tie on a 500GB file, since ~~it sluprs the entire file~~... uhm well, have to acknowledge being disinformed here.. thanks ikegami and blue_cowdawg. I'd go indeed with 4), BerkeleyDB (DB_BTREE), key line number, value byte offset. That would allow you to do delta seeks back and forth. Splitting the one big file into a reasonable number (depending on number of file descriptors available) of smaller chunks might be helpful, as rpanman already noted. update: obvious correction --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^2: Help performing "random" access on very very large file by blue_cowdawg (Monsignor) on Jul 16, 2007 at 15:07 UTC
I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file. Uhhhhmmmmm... no it don't! From the File::Tie man page: The file is not loaded into memory, so this will work even for gigantic files. I even remember looking at the source code for this module once to validate that statement. I used to think that the module loaded the whole file into memory but it doesn't. However, there is a lot of overhead associated with File::Tie for certain operations. For instance I had some code once that looked something like: `# considering @ry was the tie'd array: my $last_line = $#ry + 1;` [download] and found that if I visited that line multiple times it slowed my code way down. Just FWIW... Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l]
Re^2: Help performing "random" access on very very large file by ikegami (Patriarch) on Jul 16, 2007 at 15:05 UTC
I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file. If you're only reading from the file (as it seems to be the case here), Tie::File doesn't slurp the file. (I don't know how it handles writes.) It does keep a cache of lines in memory, but the size of the cache is configurable. What it does do is keep in memory the index of every line up until the last one accessed. `scalar(@tied)` and `$#tied` count as accessing the last line. So if you do `random(@tied)`, Tie::File will read through the entire file to build an (in-memory) index of every line in the file. The index could easily take many GBs.	[reply] [d/l] [select]
Re: Help performing "random" access on very very large file by Moron (Curate) on Jul 16, 2007 at 15:48 UTC
Where the information exceeds what can be stored in memory, the most usual solution is some kind of DBMS. Relational databases are only semi-infeasible for this kind of data structure in the sense you could store say 100000 lines in a BLOB and index those - an approach talked about but I have never seen tried. The most common custom DBMS structure is indeed flat-file-based as has been mentioned. But the most useful Perl implementations of non-relational database architecture are probably going to be the DBM family of modules. In particular, DBM::Deep (which has an underlying flat-file storage) looks perfect for your requirement. __________________________________________________________________________________ ^M Free your mind!	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.