in reply to Random sampling a variable record-length file.
Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.
(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (zero-based one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)
Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.
To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)
perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ +< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = +$_; } } } print @a;' 3 filename
Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.
$ cat a one two three four five six $ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l += 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = +rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \ + ; echo; done) | sort | uniq -c | sort -rn 1747 five four three 1736 five one six 1735 five three two 1725 four three two 1707 one six three 1695 five four two 1685 five six three 1684 five six two 1678 one three two 1666 five four six 1663 four six two 1663 four one six 1663 five four one 1645 four one two 1640 four six three 1637 five one three 1616 five one two 1592 six three two 1578 one six two 1578 four one three $
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Random sampling a variable length file.
by BrowserUk (Patriarch) on Dec 26, 2009 at 17:24 UTC | |
by bcrowell2 (Friar) on Dec 26, 2009 at 18:19 UTC | |
by BrowserUk (Patriarch) on Dec 26, 2009 at 18:39 UTC | |
by bobf (Monsignor) on Dec 26, 2009 at 22:05 UTC | |
by BrowserUk (Patriarch) on Dec 27, 2009 at 00:54 UTC | |
| |
by bcrowell2 (Friar) on Dec 26, 2009 at 22:14 UTC | |
by bcrowell2 (Friar) on Dec 26, 2009 at 18:43 UTC | |
by eye (Chaplain) on Dec 27, 2009 at 07:02 UTC | |
by BrowserUk (Patriarch) on Dec 27, 2009 at 09:37 UTC | |
|