Re: Fast way to read from file
by Roger (Parson) on Nov 21, 2003 at 12:14 UTC
|
There is a fast way to seek to desired line and read given number of lines. It involves a pre-step. The basic idea is as follows:
Pre-step (once only initialization step)
Build an array of line offsets, @offset, as you loop through the file on the first pass. So $offset[120] is the byte offset of the 121st line in the file (zero based array index).
my $curpos = 0;
my @offset;
$offset[0] = 0; # 1st line always start at pos 0
while (<$fh>) {
$offset[++$curpos] = tell $fh;
}
Fast-read (repeatedly called)
Seek to the byte offset for the desired start line, with a single seek -
seek $fh, $offset[$cur], 0;
Read the next $row - $cur + 1 number of lines
| [reply] [d/l] [select] |
|
|
Yep. I considered that. But was worried on huge files index size. In order to be fast i'd need to record the offset from beginning for every line and that might grow to be pretty big. Although i don't know how much it takes memory to store large values (eg. million cell array, with large values). What i'm hoping here is a way to "seek newlines" :).
| [reply] |
|
|
I think if you want the fastest seek to line on the fly without massive memory overhead, you could go the XS approach. Write a C function somewhat like this -
seek_line(FILE *f, long lineno)
{
char buffer[65536]; # 64k buffer
fseek(f, 0, SEEK_SET);
// read in 64 k of text at a time
fread(buffer, 65536, 1, f);
// then count number of \n's
// if found the desired number of lines, then
// record the position found, fseek to that
// position in file, and then return
// if not enough lines, read next block of 64k text
// and repeat until you have the desired number of
// lines. fseek to the absolute position in file
// for the desired line, and then return
}
Expose this function in Perl, so that you can do
seek_line($fh, $line_no);
# and then read as normal
| [reply] [d/l] [select] |
|
|
|
|
If you can spend the time (and the I/O) for a complete read before you start to use the file, you do not actually need to create a complete index. You could have an index that indexes every 10 (or 100) lines, so that you approximately know where you are reading, so you only have to read a small number of useless lines before the one you are actually looking for. This might be a viable solution.
If your lines have a line number that's univocally understandable, you could use a binary partition method to find your line in about log2(n) passes without using an index. You should:
- Get the total size of the file
- Read the first complete line that's about the half of the file
- If it's over the line you are looking for, repeat the process seeking the middle point between the middle of the file and the end of the file;
- if it's below, repeat seeking the middle point between 0 and the middle of the file
- if it's it, you return it and exit
You have to take care that if the line you are looking for does not exist in the file, you cannot loop indefinitely (maybe set a maximum number of steps before bailing out).
If the line length of the lines within the files is expected to be more or less even, you could also think about a prediction method to get the expected point. It goes like this:
- We read a line in the middle of the file
- We now know it's line number (say, 10038) and that it's located at offset 120046
- We expect the average line to be 11.9 chars long
- If we are looking for line 9500, we start seeking at offset 9500*11.9 = 113050 and the we adjust our estimate based on what we actually find
Of course all those things have to be thought of seeing the actual case.
| [reply] |
|
|
Roger, I see we think alike. ;-) I guess the approach you take would depend on if you want to spend the indexing time up front or on the fly.
| [reply] |
Re: Fast way to read from file
by Abigail-II (Bishop) on Nov 21, 2003 at 11:58 UTC
|
If speed is important, don't even bother with a tied object.
The tie mechanism is slow.
But your set_line routine utterly confuses me. Seeking goes
to *byte* positions - it has no idea what a line means.
Abigail | [reply] |
|
|
But your set_line routine utterly confuses me. Seeking goes to *byte* positions - it has no idea what a line means.
Well, since i know the line i'm currently and i know the line i want to be is smaller, i rewind the filehandle. Seek does that nicely (and fast, unless i'm mistaken).
However i still read with readline(). Basicly that causes io for all line. Since io is slow and i don't really need the lines between 0 -> wanted (or current -> wanted), i think there should be a way to move faster between newlines *shrug*.
| [reply] |
|
|
| [reply] |
Re: Fast way to read from file
by BrowserUk (Patriarch) on Nov 21, 2003 at 13:23 UTC
|
I agree with Roger that creating an index to the lines is the fastest way to randomly access the lines in a file.
Unfortunately, unless the lines are fairly long, say >30 chars, an array of line positions, takes just as much memory as storing the whole file in an array. And if the file is really large, then your almost as likely to run out of memory just storing the line offsets as you are storing the lines themselves.
An alternative is to store the offsets as binary values in a single scalar using pack. This requires only 4MB to store the offsets for a million line file, which would require 60MB to store the same information as an array.
A simple sub using unpack, substr & seek can then be used to read lines by lineno quickly and efficiently.
open IN, '<test1000000.dat' or die $!;
$offsets = pack 'V', 0;
$offsets .= pack 'V', tell IN while <IN>;
print length $offsets;
4000008
sub readline_n{
my( $fh, $line) = @_;
seek $fh, unpack( 'V', substr( $offsets, --$line*4, 4 )), 0;
scalar <$fh>
}
print readline_n( \*IN, 500000 );
500000
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] |
|
|
or combine substr and unpack with vec.
vec($offsets, $., 32) = tell IN while <IN>;
...
seek $fh, vec($offsets, $line, 32), 0;
| [reply] [d/l] |
|
|
Good point. I should have thought of that, it's a nice simplification.
It's a shame vec doesn't handle 24-bit integers, else we could cut the memory requirement by another 25% and still handle 16 millions line files which is probably sufficient for most purposes:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] |
|
|
Maybe I'm missing something but when I try this code I get the following results on an AIX box using timex:
real 10.10
user 8.02
sys 1.00
Whereas, if I just do something "simple" like:
open IN, '<test.dat' or die $!;
while (<IN>) {
{increment counter, if counter is 500000, return results}
I get:
real 1.70
user 1.44
sys 0.06
???
| [reply] |
|
|
From my understanding of the OP's post. His requirements were that he should be able to randomly access the file by line number.
It will obviously be slower than a simple sequential read of the file as it has to do that once in order to build teh index. It is only when subsequent use is made to re-read individual records in random order that it will be beneficial.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] |
Re: Fast way to read from file
by bschmer (Friar) on Nov 21, 2003 at 12:30 UTC
|
It seems to me that you are taking the fastest approach possible, given the assumption that the rows in your file are of varying lengths. If the rows were the same length, then it would be very easy (and quick) to get the point in the file that you want to read.
It seems that you are performing multiple reads from this large file, otherwise the performance wouldn't really be an issue. Is there any way that you can batch up your reads so that you can leverage the scan through the file? For example, reading rows 1, 2, 4 and 16 by grabbing rows 1, 2 and 4 on the way to getting 16 will be faster than reading each of the rows individually.
If you have the memory and are doing enough row access, another approach would be to save the offset of each line in the file the first time through that part of the file so that the next time you try to retrieve a row that you've already seen, you can jump right to that row. I'm thinking something like:
my @offsets;
sub Set_line (\*\$\$) {
my $fh_r=shift @_;
my $cur_r=shift @_;
my $row_r=shift @_;
if ($offsets[$$row_r]){
# We know where the row is, so just go there.
seek($ffh_r, $offsets[$$row_r], 0);
} elsif ($$cur_r>$$row_r) {
# We don't know where it is, so we should start at the end of the
+area that we've indexed
seek ($$fh_r, $offsets[-1],0);
$$cur_r= scalar $offsets;
}
while (!(($$cur_r)==$$row_r) && $_ ne "") {
$_=readline($fh_r);
$$cur_r++;
$offset[$$cur_r] = tell($$fh_r);
}
return;
}
Or something like that. Kind of on-the-fly indexing. This code isn't tested, but the idea is there. Should give you some performance improvement at the cost of some memory. Or you could tie the offset array to a file to build a persistent index for future use.
| [reply] [d/l] |
Re: Fast way to read from file
by duff (Parson) on Nov 21, 2003 at 16:00 UTC
|
One thing that might be an optimization for you is to keep a cache of line positions. You could do it as others have suggested by caching all of the line offsets or, if you're not interested in reading all of the lines, but just a few in localized areas, just cache the offset of the few lines that you read so that if you've read line 57 and need to read line 63, you can start reading from the cached line position for line 57. Of course, there will be some overhead to lookup these cached line positions, but it may outweigh the cost of reading all of the lines over again or storing offsets for all of the lines. It all depends on your application.
| [reply] |
Re: Fast way to read from file
by b10m (Vicar) on Nov 21, 2003 at 11:58 UTC
|
Atm. i'm reading info on Tie::File (which should allow read/write, but i'm not intrested on write to that file).
Than it would seem to me that the tie is not necessary.
A quick-n-dirty way would be something like:
my $i = 0; # counter that will increment with each line
my $q = 33;# the line you want
my $line; # this variable will hold your line in the end
open (FILE, "<$file");
while(<FILE>) {
$line = $_ if ($q == $i);
$i++;
}
close FILE;
But I'm pretty sure there is a "nicer" way to handle this. And I'm quite sure I miss the true question here :(
| [reply] [d/l] |
Re: Fast way to read from file
by Taulmarill (Deacon) on Nov 21, 2003 at 13:01 UTC
|
hm, donīt know if it helps, but i use this nice one liner:
perl -e'open(FL,"primes.txt");$. == 10000 && ($line = $_) while <FL>;print("$line\n");'
it takes < 1 sec. to find the 10,000th line and uses very few memory. | [reply] [d/l] |
Re: Fast way to read from file
by jonadab (Parson) on Nov 21, 2003 at 14:55 UTC
|
The solutions others have pointed out are the best you
can do if you are reading a Unix-type
stream file. (If you are on Unix or Windows, this
is the normal type of file.) Some platforms, however,
have other types of files (and yes, other types of text
files) that are stored differently at the filesystem
level in such a way that seeking to a specific line is
much less expensive. The index solution is an attempt
to fake this, but it's less efficient than the real
thing. I don't know how well Perl supports those
filetypes, though, and in any case it doesn't help
you if you're on Unix or Windows. But if you're
developing something that's going to be deployed
as a "solution" and you haven't solidified what
platform you're going to run it on yet, this is
something to consider.
There are also other ways to fake it, besides
the index method. If you can pin down the maximum
line length, for example, and if you have control
over the thing that creates these files, you could
use a fixed "record" (line) length, which makes
seeking to a specific line very fast (O(1) time).
It also makes your file take up more total space,
up to twice as much or more, depending on the
ratio between typical and maximum line length,
but depending on what you're doing it may be
a good deal cheaper to get a bigger disk than
a faster one, so this could be a good tradeoff.
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] |