Fast way to read from file

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fast way to read from file by Roger (Parson) on Nov 21, 2003 at 12:14 UTC
There is a fast way to seek to desired line and read given number of lines. It involves a pre-step. The basic idea is as follows: Pre-step (once only initialization step) Build an array of line offsets, `@offset`, as you loop through the file on the first pass. So `$offset[120]` is the byte offset of the 121st line in the file (zero based array index). `my $curpos = 0; my @offset; $offset[0] = 0; # 1st line always start at pos 0 while (<$fh>) { $offset[++$curpos] = tell $fh; }` [download] Fast-read (repeatedly called) Seek to the byte offset for the desired start line, with a single seek - `seek $fh, $offset[$cur], 0;` [download] Read the next $row - $cur + 1 number of lines	[reply] [d/l] [select]
Re: Re: Fast way to read from file by Hena (Friar) on Nov 21, 2003 at 12:22 UTC
Yep. I considered that. But was worried on huge files index size. In order to be fast i'd need to record the offset from beginning for every line and that might grow to be pretty big. Although i don't know how much it takes memory to store large values (eg. million cell array, with large values). What i'm hoping here is a way to "seek newlines" :).	[reply]
Re: Re: Re: Fast way to read from file by Roger (Parson) on Nov 21, 2003 at 12:49 UTC
I think if you want the fastest seek to line on the fly without massive memory overhead, you could go the XS approach. Write a C function somewhat like this - seek_line(FILE *f, long lineno) { char buffer[65536]; # 64k buffer fseek(f, 0, SEEK_SET); // read in 64 k of text at a time fread(buffer, 65536, 1, f); // then count number of \n's // if found the desired number of lines, then // record the position found, fseek to that // position in file, and then return // if not enough lines, read next block of 64k text // and repeat until you have the desired number of // lines. fseek to the absolute position in file // for the desired line, and then return } [download] Expose this function in Perl, so that you can do `seek_line($fh, $line_no); # and then read as normal` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: Fast way to read from file by broquaint (Abbot) on Nov 21, 2003 at 14:53 UTC
Re: Re: Re: Fast way to read from file by l3nz (Friar) on Nov 21, 2003 at 13:42 UTC
If you can spend the time (and the I/O) for a complete read before you start to use the file, you do not actually need to create a complete index. You could have an index that indexes every 10 (or 100) lines, so that you approximately know where you are reading, so you only have to read a small number of useless lines before the one you are actually looking for. This might be a viable solution. If your lines have a line number that's univocally understandable, you could use a binary partition method to find your line in about log2(n) passes without using an index. You should: Get the total size of the file Read the first complete line that's about the half of the file If it's over the line you are looking for, repeat the process seeking the middle point between the middle of the file and the end of the file; if it's below, repeat seeking the middle point between 0 and the middle of the file if it's it, you return it and exit You have to take care that if the line you are looking for does not exist in the file, you cannot loop indefinitely (maybe set a maximum number of steps before bailing out). If the line length of the lines within the files is expected to be more or less even, you could also think about a prediction method to get the expected point. It goes like this: We read a line in the middle of the file We now know it's line number (say, 10038) and that it's located at offset 120046 We expect the average line to be 11.9 chars long If we are looking for line 9500, we start seeking at offset 9500*11.9 = 113050 and the we adjust our estimate based on what we actually find Of course all those things have to be thought of seeing the actual case.	[reply]
Re: Re: Fast way to read from file by bschmer (Friar) on Nov 21, 2003 at 12:34 UTC
Roger, I see we think alike. ;-) I guess the approach you take would depend on if you want to spend the indexing time up front or on the fly.	[reply]
Re: Fast way to read from file by Abigail-II (Bishop) on Nov 21, 2003 at 11:58 UTC
If speed is important, don't even bother with a tied object. The tie mechanism is slow. But your set_line routine utterly confuses me. Seeking goes to byte positions - it has no idea what a line means. Abigail	[reply]
Re: Re: Fast way to read from file by Hena (Friar) on Nov 21, 2003 at 12:13 UTC
But your set_line routine utterly confuses me. Seeking goes to byte positions - it has no idea what a line means. Well, since i know the line i'm currently and i know the line i want to be is smaller, i rewind the filehandle. Seek does that nicely (and fast, unless i'm mistaken). However i still read with readline(). Basicly that causes io for all line. Since io is slow and i don't really need the lines between 0 -> wanted (or current -> wanted), i think there should be a way to move faster between newlines shrug.	[reply]
Re: Fast way to read from file by Abigail-II (Bishop) on Nov 21, 2003 at 12:39 UTC
A file is just a stream of bytes. A newline is a byte (or two on Windows) just like any other byte. There's no way to find a newline other than the read in the bytes until you find it. Now, if you can do preprocessing, you could build an index where lines start. Said preprocessing will take more time than going through the file once though. Abigail	[reply]
Re: Fast way to read from file by BrowserUk (Patriarch) on Nov 21, 2003 at 13:23 UTC
I agree with Roger that creating an index to the lines is the fastest way to randomly access the lines in a file. Unfortunately, unless the lines are fairly long, say >30 chars, an array of line positions, takes just as much memory as storing the whole file in an array. And if the file is really large, then your almost as likely to run out of memory just storing the line offsets as you are storing the lines themselves. An alternative is to store the offsets as binary values in a single scalar using pack. This requires only 4MB to store the offsets for a million line file, which would require 60MB to store the same information as an array. A simple sub using unpack, substr & seek can then be used to read lines by lineno quickly and efficiently. `open IN, '<test1000000.dat' or die $!; $offsets = pack 'V', 0; $offsets .= pack 'V', tell IN while <IN>; print length $offsets; 4000008 sub readline_n{ my( $fh, $line) = @_; seek $fh, unpack( 'V', substr( $offsets, --$line4, 4 )), 0; scalar <$fh> } print readline_n( \IN, 500000 ); 500000` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l]
Re: Re: Fast way to read from file by Anonymous Monk on Nov 21, 2003 at 14:49 UTC
or combine substr and unpack with vec. `vec($offsets, $., 32) = tell IN while <IN>; ... seek $fh, vec($offsets, $line, 32), 0;` [download]	[reply] [d/l]
Re: Re: Re: Fast way to read from file by BrowserUk (Patriarch) on Nov 21, 2003 at 16:23 UTC
Good point. I should have thought of that, it's a nice simplification. It's a shame vec doesn't handle 24-bit integers, else we could cut the memory requirement by another 25% and still handle 16 millions line files which is probably sufficient for most purposes:) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Re: Fast way to read from file by Anonymous Monk on Nov 21, 2003 at 18:03 UTC
Maybe I'm missing something but when I try this code I get the following results on an AIX box using timex: real 10.10 user 8.02 sys 1.00 Whereas, if I just do something "simple" like: open IN, '<test.dat' or die $!; while (<IN>) { {increment counter, if counter is 500000, return results} I get: real 1.70 user 1.44 sys 0.06 ???	[reply]
Re: Re: Re: Fast way to read from file by BrowserUk (Patriarch) on Nov 21, 2003 at 18:47 UTC
From my understanding of the OP's post. His requirements were that he should be able to randomly access the file by line number. It will obviously be slower than a simple sequential read of the file as it has to do that once in order to build teh index. It is only when subsequent use is made to re-read individual records in random order that it will be beneficial. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Fast way to read from file by bschmer (Friar) on Nov 21, 2003 at 12:30 UTC
It seems to me that you are taking the fastest approach possible, given the assumption that the rows in your file are of varying lengths. If the rows were the same length, then it would be very easy (and quick) to get the point in the file that you want to read. It seems that you are performing multiple reads from this large file, otherwise the performance wouldn't really be an issue. Is there any way that you can batch up your reads so that you can leverage the scan through the file? For example, reading rows 1, 2, 4 and 16 by grabbing rows 1, 2 and 4 on the way to getting 16 will be faster than reading each of the rows individually. If you have the memory and are doing enough row access, another approach would be to save the offset of each line in the file the first time through that part of the file so that the next time you try to retrieve a row that you've already seen, you can jump right to that row. I'm thinking something like: `my @offsets; sub Set_line (\*\$\$) { my $fh_r=shift @_; my $cur_r=shift @_; my $row_r=shift @_; if ($offsets[$$row_r]){ # We know where the row is, so just go there. seek($ffh_r, $offsets[$$row_r], 0); } elsif ($$cur_r>$$row_r) { # We don't know where it is, so we should start at the end of the +area that we've indexed seek ($$fh_r, $offsets[-1],0); $$cur_r= scalar $offsets; } while (!(($$cur_r)==$$row_r) && $_ ne "") { $_=readline($fh_r); $$cur_r++; $offset[$$cur_r] = tell($$fh_r); } return; }` [download] Or something like that. Kind of on-the-fly indexing. This code isn't tested, but the idea is there. Should give you some performance improvement at the cost of some memory. Or you could tie the offset array to a file to build a persistent index for future use.	[reply] [d/l]
Re: Fast way to read from file by duff (Parson) on Nov 21, 2003 at 16:00 UTC
One thing that might be an optimization for you is to keep a cache of line positions. You could do it as others have suggested by caching all of the line offsets or, if you're not interested in reading all of the lines, but just a few in localized areas, just cache the offset of the few lines that you read so that if you've read line 57 and need to read line 63, you can start reading from the cached line position for line 57. Of course, there will be some overhead to lookup these cached line positions, but it may outweigh the cost of reading all of the lines over again or storing offsets for all of the lines. It all depends on your application. duff	[reply]
Re: Fast way to read from file by b10m (Vicar) on Nov 21, 2003 at 11:58 UTC
Atm. i'm reading info on Tie::File (which should allow read/write, but i'm not intrested on write to that file). Than it would seem to me that the tie is not necessary. A quick-n-dirty way would be something like: `my $i = 0; # counter that will increment with each line my $q = 33;# the line you want my $line; # this variable will hold your line in the end open (FILE, "<$file"); while(<FILE>) { $line = $_ if ($q == $i); $i++; } close FILE;` [download] But I'm pretty sure there is a "nicer" way to handle this. And I'm quite sure I miss the true question here :( -- B10m	[reply] [d/l]
Re: Fast way to read from file by Taulmarill (Deacon) on Nov 21, 2003 at 13:01 UTC
hm, don�t know if it helps, but i use this nice one liner: `perl -e'open(FL,"primes.txt");$. == 10000 && ($line = $_) while <FL>;print("$line\n");'` it takes < 1 sec. to find the 10,000th line and uses very few memory.	[reply] [d/l]
Re: Fast way to read from file by jonadab (Parson) on Nov 21, 2003 at 14:55 UTC
The solutions others have pointed out are the best you can do if you are reading a Unix-type stream file. (If you are on Unix or Windows, this is the normal type of file.) Some platforms, however, have other types of files (and yes, other types of text files) that are stored differently at the filesystem level in such a way that seeking to a specific line is much less expensive. The index solution is an attempt to fake this, but it's less efficient than the real thing. I don't know how well Perl supports those filetypes, though, and in any case it doesn't help you if you're on Unix or Windows. But if you're developing something that's going to be deployed as a "solution" and you haven't solidified what platform you're going to run it on yet, this is something to consider. There are also other ways to fake it, besides the index method. If you can pin down the maximum line length, for example, and if you have control over the thing that creates these files, you could use a fixed "record" (line) length, which makes seeking to a specific line very fast (O(1) time). It also makes your file take up more total space, up to twice as much or more, depending on the ratio between typical and maximum line length, but depending on what you're doing it may be a good deal cheaper to get a bigger disk than a faster one, so this could be a good tradeoff. `$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/` [download]	[reply] [d/l]