Updated: Changed 'minimum' to 'maximum' as appropriate.

By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.

For a fixed record length file they all have an even chance. If your file contains all the asciized numbers from 1 to 10_000_000, then high values have a greater chance than the lower numbers, regardless of where they are in the file. The lines in many variable-length record datafiles are roughly the same length within some tolorance, so it's not a very bad method of picking if you don't require statistically random picks.

If your file contains variable width chars, then those containing larger numbers of 'big' chars get another bias over those containing only small ones. seek/tell operate at the byte level as to do otherwise would make them intolorably slow.

If you can live with those biases, then the problem becomes how to read the line containing the randomly chosen byte efficiently. For the picking to be reasonably fair, your lines need to be roughly the same length. If you can make a reasonable guess as to the maximum length of any line, then you can minimize the amount of seeking and reading you need to do by:

  1. Pick a random byte.
  2. Subtract your maximum guess.
  3. seek to that position.
  4. Read a line from that position.
  5. Read more lines until you've read past the chosen position.

You now have the line containing the randomly chosen byte.

Utilising the knowledge that seek will silently do nothing if the absolute position sought is negative, you can implement that (where $GLEN is you guess at maximum length), as:

open my $fh, '<', $file or die "$file : $!"; my $p = int( 1 + rand -s $fh ); seek $fh, $p - $GLEN, 0; chomp( my $pick = <$fh> ); chomp( $pick = <$fh> ) while tell( $fh ) < $p;

The more accurate your guess, the less reading required. If the guess is conservatively long, then you will read a few extra lines before finding the required one. Your guess must never be shorter!

The following does a crude 'count the lines picked' analysis:

#! perl -slw use strict; our $GLEN ||= 10; our $TRIES ||= 1000; sub randLine { open my $fh, '<', $_[ 0 ] or die "$_[ 0 ] : $!"; my $p = int( 1 + rand -s $fh ); seek $fh, $p - $GLEN, 0; chomp( my $pick = <$fh> ); chomp( $pick = <$fh> ) while tell( $fh ) < $p; return $pick; } my %stats; for ( 1 .. $TRIES ) { $stats{ randLine( $ARGV[ 0 ] ) }++ ; } print "$_ : $stats{ $_ }" for sort keys %stats;

The following sample runs show that over-estimating the maximum length makes no difference to the statistical outcome, and highlights the bias towards longer lines (though over-estimates obviously make it less efficient):

P:\test>414916 -GLEN=10 -TRIES=100000 junk line 1 : 9964 line 10 : 11168 line 2 : 9862 line 3 : 9999 line 4 : 9889 line 5 : 9853 line 6 : 9749 line 7 : 9963 line 8 : 9751 line 9 : 9802 P:\test>414916 -GLEN=100 -TRIES=100000 junk line 1 : 9910 line 10 : 11195 line 2 : 9826 line 3 : 9830 line 4 : 9851 line 5 : 9772 line 6 : 9965 line 7 : 9854 line 8 : 9991 line 9 : 9806

And this show that you get bad results if you underestimate the maximum length:

P:\test>414916 -GLEN=7 -TRIES=100000 junk ine 1 : 1239 ine 10 : 1199 ine 2 : 1273 ine 3 : 1265 ine 4 : 1233 ine 5 : 1223 ine 6 : 1166 ine 7 : 1245 ine 8 : 1268 ine 9 : 1251 line 1 : 8594 line 10 : 8575 line 2 : 8630 line 3 : 8647 line 4 : 8534 line 5 : 8666 line 6 : 8768 line 7 : 8714 line 8 : 8581 line 9 : 8683 ne 10 : 1246

Examine what is said, not who speaks.        The end of an era!
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: Approach to efficiently choose a random line from a large file by BrowserUk
in thread Approach to efficiently choose a random line from a large file by Your Mother

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.