Re: Approach to efficiently choose a random line from a large file

Kuth, in The Art of Computer Programming (Vol II p144 3rd ed.) gives a simple algorithm for selection lines from a file when the total number of lines isn't known ahead of time. How do I pick a random line from a file? seems to be a variation of that idea. It might be worth looking at again.

If you know the number of lines in the file ahead of time, you could try something like this

    my $t = 0;   # nr of lines already considered
    my $N = 10;  # nr of lines in file

    do{
        my $line = <>;
    }until(($N-$t++)*rand < 1);

    print $line;
[download]

This assumes you are selecting only one line (there is a simple mod if you want more than one, say m, lines) . You step through the file one line at a time and decide whether to select the current line. If not, you move on to the next one and so on. On average, you will only have to read (N+1)n/(n+1) lines before selecting one. If your files contain 10 million records, you would, on average only read about 5 million lines before selecting one. How do I pick a random line from a file? requires a complete pass through all lines each time. Depending on your setup or requirements, the savings might be worth it.

Update: included the incrementor for $t

PJ
use strict; use warnings; use diagnostics;

Comment on Re: Approach to efficiently choose a random line from a large file Download Code