comment on

The cookbook has a nice example of how to pull one line of a file out without loading all the lines into memory, or creating an offset index. My challange, I decided, was to do the same but with N lines, without slurping the file or seeking through an offset index. Here's my solution...

It's definitely a tradeoff of memory vs cpu. For the actual work problem that inspired this, I used an offset index to be reused in mod_perl. Still, this solution might be useful if your file has so many lines that even the index is too large, or if you're simply looking for Another Way To Do It.

Basically, it's the same as the cookbook one. For each line you pass, the probability that you pick it up and assign it to a slot is size of the sample divided by the number of lines seen so far. If you do pick it up, you need to decide which element you currently have to throw out, so you pick a random offset based on the size of sampling.

I think this works out mathmatically (check, anyone?), and only needs special case handling for the first N lines, which you need to keep all of. I thought of just reading in N lines to begin with, but that loses some randomness in order that would have to be handled with a shuffle at the end. I actually decided to go with some extra handling in the algorithm, but it does add some overhead so I'm still considering the alternative.


use strict;
my $size = shift @ARGV || 3; # sample size
my @sample;


my $taken = 0; # for making sure we get the first $size lines

while( my $line =  <> ) {
    chomp $line;

    if ( rand(1) < ($size/$.) ){
        my $position;
        do{
            $position = int rand($size);
        }
        while( $taken < $size && $sample[$position] );
        $sample[$position] = $line;
        $taken++;
    }
}

for( my $i = 0; $i < @sample; $i++ ){
    print "[$i] '$sample[$i]'\n";
}
print "\n";
[download]

In reply to selecting N random lines from a file in one pass by seuratt

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.