comment on

I agree with Roger that creating an index to the lines is the fastest way to randomly access the lines in a file.

Unfortunately, unless the lines are fairly long, say >30 chars, an array of line positions, takes just as much memory as storing the whole file in an array. And if the file is really large, then your almost as likely to run out of memory just storing the line offsets as you are storing the lines themselves.

An alternative is to store the offsets as binary values in a single scalar using pack. This requires only 4MB to store the offsets for a million line file, which would require 60MB to store the same information as an array.

A simple sub using unpack, substr & seek can then be used to read lines by lineno quickly and efficiently.

open IN, '<test1000000.dat' or die $!;
$offsets = pack 'V', 0; 
$offsets .= pack 'V', tell IN while <IN>;

print length $offsets;
4000008

sub readline_n{ 
    my( $fh, $line) = @_; 
    seek $fh, unpack( 'V', substr( $offsets, --$line*4, 4 )), 0; 
    scalar <$fh> 
}

print readline_n( \*IN, 500000 );
500000
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`