in reply to Fast way to read from file

I agree with Roger that creating an index to the lines is the fastest way to randomly access the lines in a file.

Unfortunately, unless the lines are fairly long, say >30 chars, an array of line positions, takes just as much memory as storing the whole file in an array. And if the file is really large, then your almost as likely to run out of memory just storing the line offsets as you are storing the lines themselves.

An alternative is to store the offsets as binary values in a single scalar using pack. This requires only 4MB to store the offsets for a million line file, which would require 60MB to store the same information as an array.

A simple sub using unpack, substr & seek can then be used to read lines by lineno quickly and efficiently.

open IN, '<test1000000.dat' or die $!; $offsets = pack 'V', 0; $offsets .= pack 'V', tell IN while <IN>; print length $offsets; 4000008 sub readline_n{ my( $fh, $line) = @_; seek $fh, unpack( 'V', substr( $offsets, --$line*4, 4 )), 0; scalar <$fh> } print readline_n( \*IN, 500000 ); 500000

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

Replies are listed 'Best First'.
Re: Re: Fast way to read from file
by Anonymous Monk on Nov 21, 2003 at 14:49 UTC
    or combine substr and unpack with vec.
    vec($offsets, $., 32) = tell IN while <IN>; ... seek $fh, vec($offsets, $line, 32), 0;

      Good point. I should have thought of that, it's a nice simplification.

      It's a shame vec doesn't handle 24-bit integers, else we could cut the memory requirement by another 25% and still handle 16 millions line files which is probably sufficient for most purposes:)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

Re: Re: Fast way to read from file
by Anonymous Monk on Nov 21, 2003 at 18:03 UTC
    Maybe I'm missing something but when I try this code I get the following results on an AIX box using timex:

    real 10.10
    user 8.02
    sys 1.00

    Whereas, if I just do something "simple" like:

    open IN, '<test.dat' or die $!;
    while (<IN>) {
    {increment counter, if counter is 500000, return results}

    I get:

    real 1.70
    user 1.44
    sys 0.06

    ???

      From my understanding of the OP's post. His requirements were that he should be able to randomly access the file by line number.

      It will obviously be slower than a simple sequential read of the file as it has to do that once in order to build teh index. It is only when subsequent use is made to re-read individual records in random order that it will be beneficial.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!