comment on

Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (~~zero-based~~ one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ 
+< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = 
+$_; } } } print @a;' 3 filename
[download]

Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

$ cat a
one
two
three
four
five
six
$ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l 
+= 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = 
+rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \
+  ; echo; done) | sort | uniq -c | sort -rn
   1747 five four three 
   1736 five one six 
   1735 five three two 
   1725 four three two 
   1707 one six three 
   1695 five four two 
   1685 five six three 
   1684 five six two 
   1678 one three two 
   1666 five four six 
   1663 four six two 
   1663 four one six 
   1663 five four one 
   1645 four one two 
   1640 four six three 
   1637 five one three 
   1616 five one two 
   1592 six three two 
   1578 one six two 
   1578 four one three 
$
[download]

In reply to Re: Random sampling a variable length file. by ambrus
in thread Random sampling a variable record-length file. by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.