in reply to Approach to efficiently choose a random line from a large file

Updated: Changed 'minimum' to 'maximum' as appropriate.

By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.

For a fixed record length file they all have an even chance. If your file contains all the asciized numbers from 1 to 10_000_000, then high values have a greater chance than the lower numbers, regardless of where they are in the file. The lines in many variable-length record datafiles are roughly the same length within some tolorance, so it's not a very bad method of picking if you don't require statistically random picks.

If your file contains variable width chars, then those containing larger numbers of 'big' chars get another bias over those containing only small ones. seek/tell operate at the byte level as to do otherwise would make them intolorably slow.

If you can live with those biases, then the problem becomes how to read the line containing the randomly chosen byte efficiently. For the picking to be reasonably fair, your lines need to be roughly the same length. If you can make a reasonable guess as to the maximum length of any line, then you can minimize the amount of seeking and reading you need to do by:

  1. Pick a random byte.
  2. Subtract your maximum guess.
  3. seek to that position.
  4. Read a line from that position.
  5. Read more lines until you've read past the chosen position.

You now have the line containing the randomly chosen byte.

Utilising the knowledge that seek will silently do nothing if the absolute position sought is negative, you can implement that (where $GLEN is you guess at maximum length), as:

open my $fh, '<', $file or die "$file : $!"; my $p = int( 1 + rand -s $fh ); seek $fh, $p - $GLEN, 0; chomp( my $pick = <$fh> ); chomp( $pick = <$fh> ) while tell( $fh ) < $p;

The more accurate your guess, the less reading required. If the guess is conservatively long, then you will read a few extra lines before finding the required one. Your guess must never be shorter!

The following does a crude 'count the lines picked' analysis:

#! perl -slw use strict; our $GLEN ||= 10; our $TRIES ||= 1000; sub randLine { open my $fh, '<', $_[ 0 ] or die "$_[ 0 ] : $!"; my $p = int( 1 + rand -s $fh ); seek $fh, $p - $GLEN, 0; chomp( my $pick = <$fh> ); chomp( $pick = <$fh> ) while tell( $fh ) < $p; return $pick; } my %stats; for ( 1 .. $TRIES ) { $stats{ randLine( $ARGV[ 0 ] ) }++ ; } print "$_ : $stats{ $_ }" for sort keys %stats;

The following sample runs show that over-estimating the maximum length makes no difference to the statistical outcome, and highlights the bias towards longer lines (though over-estimates obviously make it less efficient):

P:\test>414916 -GLEN=10 -TRIES=100000 junk line 1 : 9964 line 10 : 11168 line 2 : 9862 line 3 : 9999 line 4 : 9889 line 5 : 9853 line 6 : 9749 line 7 : 9963 line 8 : 9751 line 9 : 9802 P:\test>414916 -GLEN=100 -TRIES=100000 junk line 1 : 9910 line 10 : 11195 line 2 : 9826 line 3 : 9830 line 4 : 9851 line 5 : 9772 line 6 : 9965 line 7 : 9854 line 8 : 9991 line 9 : 9806

And this show that you get bad results if you underestimate the maximum length:

P:\test>414916 -GLEN=7 -TRIES=100000 junk ine 1 : 1239 ine 10 : 1199 ine 2 : 1273 ine 3 : 1265 ine 4 : 1233 ine 5 : 1223 ine 6 : 1166 ine 7 : 1245 ine 8 : 1268 ine 9 : 1251 line 1 : 8594 line 10 : 8575 line 2 : 8630 line 3 : 8647 line 4 : 8534 line 5 : 8666 line 6 : 8768 line 7 : 8714 line 8 : 8581 line 9 : 8683 ne 10 : 1246

Examine what is said, not who speaks.        The end of an era!
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: Approach to efficiently choose a random line from a large file
by sgifford (Prior) on Dec 12, 2004 at 08:36 UTC
    By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.
    An easy way to visualize this is to imagine line 1 with 1 character, and line 2 with 99 characters. Ignoring line endings, you'll get the first line 1% of the time, and the second line 99% of the time.