in reply to Fast way to read from file

There is a fast way to seek to desired line and read given number of lines. It involves a pre-step. The basic idea is as follows:

Pre-step (once only initialization step)
Build an array of line offsets, @offset, as you loop through the file on the first pass. So $offset[120] is the byte offset of the 121st line in the file (zero based array index).
my $curpos = 0; my @offset; $offset[0] = 0; # 1st line always start at pos 0 while (<$fh>) { $offset[++$curpos] = tell $fh; }
Fast-read (repeatedly called)
Seek to the byte offset for the desired start line, with a single seek -
seek $fh, $offset[$cur], 0;


Read the next $row - $cur + 1 number of lines

Replies are listed 'Best First'.
Re: Re: Fast way to read from file
by Hena (Friar) on Nov 21, 2003 at 12:22 UTC
    Yep. I considered that. But was worried on huge files index size. In order to be fast i'd need to record the offset from beginning for every line and that might grow to be pretty big. Although i don't know how much it takes memory to store large values (eg. million cell array, with large values). What i'm hoping here is a way to "seek newlines" :).
      I think if you want the fastest seek to line on the fly without massive memory overhead, you could go the XS approach. Write a C function somewhat like this -
      seek_line(FILE *f, long lineno) { char buffer[65536]; # 64k buffer fseek(f, 0, SEEK_SET); // read in 64 k of text at a time fread(buffer, 65536, 1, f); // then count number of \n's // if found the desired number of lines, then // record the position found, fseek to that // position in file, and then return // if not enough lines, read next block of 64k text // and repeat until you have the desired number of // lines. fseek to the absolute position in file // for the desired line, and then return }
      Expose this function in Perl, so that you can do
      seek_line($fh, $line_no); # and then read as normal
        Except it's almost certainly going to be slower. Why? Because perl is optimized for fast I/O access. It has all sorts of smarts to let you access your data as fast as possible. I wouldn't be surprised if b10m's simple line-reading approach is the quickest naive approach since it takes full advantage of perl's I/O capabilities.
        HTH

        _________
        broquaint

      If you can spend the time (and the I/O) for a complete read before you start to use the file, you do not actually need to create a complete index. You could have an index that indexes every 10 (or 100) lines, so that you approximately know where you are reading, so you only have to read a small number of useless lines before the one you are actually looking for. This might be a viable solution.

      If your lines have a line number that's univocally understandable, you could use a binary partition method to find your line in about log2(n) passes without using an index. You should:

      • Get the total size of the file
      • Read the first complete line that's about the half of the file
      • If it's over the line you are looking for, repeat the process seeking the middle point between the middle of the file and the end of the file;
      • if it's below, repeat seeking the middle point between 0 and the middle of the file
      • if it's it, you return it and exit
      You have to take care that if the line you are looking for does not exist in the file, you cannot loop indefinitely (maybe set a maximum number of steps before bailing out).

      If the line length of the lines within the files is expected to be more or less even, you could also think about a prediction method to get the expected point. It goes like this:

      • We read a line in the middle of the file
      • We now know it's line number (say, 10038) and that it's located at offset 120046
      • We expect the average line to be 11.9 chars long
      • If we are looking for line 9500, we start seeking at offset 9500*11.9 = 113050 and the we adjust our estimate based on what we actually find
      Of course all those things have to be thought of seeing the actual case.
Re: Re: Fast way to read from file
by bschmer (Friar) on Nov 21, 2003 at 12:34 UTC
    Roger, I see we think alike. ;-) I guess the approach you take would depend on if you want to spend the indexing time up front or on the fly.