I thought of doing something similar with an array, but decided the memory cost was too high. That was a guess, but even your much more compact idea of a packed string will occupy 40MB for a ten-million line data file. That's not necessarily prohibitive, but it might be so on a busy machine.
Here's a short bit to write the packed offsets to a file:
#!/usr/bin/perl
open my $out, '>:raw', '/path/to/data.offsets'
or die $!;
open my $in, '<', '/path/to/data.dat'
or die $!;
my $offset = 0;
local ($_,$\);
my ($this, $last) = 0;
while (<$in>) {
($last, $this) = ($this, tell $in);
print $out pack 'i', $last;
}
close $in or warn $!;
close $out or warn $!;
That file can be read and used like this:
my $index = do {
local $/;
open my $idx, '<:raw', '/path/to/data.offsets'
or die $!;
<$idx>
};
open my $dat, '<', '/path/to/data.dat'
or die $!;
my ($offset) = unpack 'i',
substr $index, 4*rand(length($index)/4), 4;
seek $dat, $offset, 0 or warn $!;
print scalar <$dat> or warn $!;
close $dat or warn $!;
close $idx or warn;
|