Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (zero-based one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ +< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = +$_; } } } print @a;' 3 filename

Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

$ cat a one two three four five six $ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l += 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = +rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \ + ; echo; done) | sort | uniq -c | sort -rn 1747 five four three 1736 five one six 1735 five three two 1725 four three two 1707 one six three 1695 five four two 1685 five six three 1684 five six two 1678 one three two 1666 five four six 1663 four six two 1663 four one six 1663 five four one 1645 four one two 1640 four six three 1637 five one three 1616 five one two 1592 six three two 1578 one six two 1578 four one three $

In reply to Re: Random sampling a variable length file. by ambrus
in thread Random sampling a variable record-length file. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.