use strict;
use warnings;
sub trial($)
{
my $result;
foreach (1..shift) {
$result= $_ if rand($_) < 1;
}
return $result;
}
############
my $trials= shift || 1000;
my $size= shift || 20;
my %results;
for (1..$trials) {
++$results{trial($size)};
}
# show the results
for (1..$size) {
printf "%5d: %5d\n", $_, ($results{$_} || 0);
}
| [reply] [d/l] |
Yes.
Say we go over n lines in that algorithm.
The last line has a 1 in n chance (probablity we
want) of being selected.
If it isn't selected, then we use the line that was
previously selected, which was either the one before
which had a 1 in n-1 chance, correct for selecting from
the n-1 remaining values. If the n-1th line wasn't choosen
then the line before that had a correct 1 in n-2 chance
of being choosen for selecting from the n-2 remaining
values. And so on and so on, till you get to 1, which will
still be choosen if all others weren't.
I hope this is understandable.
update:
To attempt to clarify/summerize/whatever after seeing sierrathedog04's
response:
At any one point in the algorithm, the chance of that line
being choosen is correct for if the algorithm ended there,
and the chance for the line choosen before staying choosen
is correct
for if the algorithm ended there (1 out of n, and
n-1 out n, respectively.) So though earlier lines are
choosen with more liklihood at that point,
they may be overridden by the later choices
and everything turns out alright.
| [reply] |
As you describe it, lines at the start of the file are more likely to be chosen than are lines at the end of the file. So the answer is no, the distribution of samples collected using the algorithm that you describe will be biased and not equal to the underlying population distribution.
| [reply] |
Do you know the number of lines in advance?
(if so, it may be more efficient to randomly select a file weighted by the number of lines in each,
then randomly select a line from that file)
| [reply] |
| [reply] |