Re: Approach to efficiently choose a random line from a large file

Updated: Changed 'minimum' to 'maximum' as appropriate.

By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.

For a fixed record length file they all have an even chance. If your file contains all the asciized numbers from 1 to 10_000_000, then high values have a greater chance than the lower numbers, regardless of where they are in the file. The lines in many variable-length record datafiles are roughly the same length within some tolorance, so it's not a very bad method of picking if you don't require statistically random picks.

If your file contains variable width chars, then those containing larger numbers of 'big' chars get another bias over those containing only small ones. seek/tell operate at the byte level as to do otherwise would make them intolorably slow.

If you can live with those biases, then the problem becomes how to read the line containing the randomly chosen byte efficiently. For the picking to be reasonably fair, your lines need to be roughly the same length. If you can make a reasonable guess as to the maximum length of any line, then you can minimize the amount of seeking and reading you need to do by:

Pick a random byte.
Subtract your maximum guess.
seek to that position.
Read a line from that position.
Read more lines until you've read past the chosen position.

You now have the line containing the randomly chosen byte.

Utilising the knowledge that seek will silently do nothing if the absolute position sought is negative, you can implement that (where $GLEN is you guess at maximum length), as:

open my $fh, '<', $file or die "$file : $!";
my $p = int( 1 + rand -s $fh );
seek $fh, $p - $GLEN, 0;
chomp( my $pick = <$fh> );
chomp( $pick = <$fh> ) while tell( $fh ) < $p;
[download]

The more accurate your guess, the less reading required. If the guess is conservatively long, then you will read a few extra lines before finding the required one. Your guess must never be shorter!

The following does a crude 'count the lines picked' analysis:

#! perl -slw
use strict;

our $GLEN ||= 10;
our $TRIES     ||= 1000;

sub randLine {
    open my $fh, '<', $_[ 0 ] or die "$_[ 0 ] : $!";
    my $p = int( 1 + rand -s $fh );
    seek $fh, $p - $GLEN, 0;
    chomp( my $pick = <$fh> );
    chomp( $pick = <$fh> ) while tell( $fh ) < $p;
    return $pick;
}

my %stats;

for ( 1 .. $TRIES ) {
    $stats{ randLine( $ARGV[ 0 ] ) }++ ;
}

print "$_ : $stats{ $_ }" for sort keys %stats;
[download]

The following sample runs show that over-estimating the maximum length makes no difference to the statistical outcome, and highlights the bias towards longer lines (though over-estimates obviously make it less efficient):

P:\test>414916 -GLEN=10 -TRIES=100000 junk
line 1 : 9964
line 10 : 11168
line 2 : 9862
line 3 : 9999
line 4 : 9889
line 5 : 9853
line 6 : 9749
line 7 : 9963
line 8 : 9751
line 9 : 9802

P:\test>414916 -GLEN=100 -TRIES=100000 junk
line 1 : 9910
line 10 : 11195
line 2 : 9826
line 3 : 9830
line 4 : 9851
line 5 : 9772
line 6 : 9965
line 7 : 9854
line 8 : 9991
line 9 : 9806
[download]

And this show that you get bad results if you underestimate the maximum length:

P:\test>414916 -GLEN=7 -TRIES=100000 junk
ine 1 : 1239
ine 10 : 1199
ine 2 : 1273
ine 3 : 1265
ine 4 : 1233
ine 5 : 1223
ine 6 : 1166
ine 7 : 1245
ine 8 : 1268
ine 9 : 1251
line 1 : 8594
line 10 : 8575
line 2 : 8630
line 3 : 8647
line 4 : 8534
line 5 : 8666
line 6 : 8768
line 7 : 8714
line 8 : 8581
line 9 : 8683
ne 10 : 1246
[download]

Examine what is said, not who speaks. The end of an era!

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: Approach to efficiently choose a random line from a large file Select or Download Code