in reply to Re: Random entry from combined data set
in thread Random entry from combined data set

Cute. Does that really give the same distribution as knowing the number of lines in advance?

Update: The definitive answer

use strict; use warnings; sub trial($) { my $result; foreach (1..shift) { $result= $_ if rand($_) < 1; } return $result; } ############ my $trials= shift || 1000; my $size= shift || 20; my %results; for (1..$trials) { ++$results{trial($size)}; } # show the results for (1..$size) { printf "%5d: %5d\n", $_, ($results{$_} || 0); }

Replies are listed 'Best First'.
Re: Re: Re: Random entry from combined data set
by wog (Curate) on Jul 04, 2001 at 08:53 UTC
    Yes. Say we go over n lines in that algorithm. The last line has a 1 in n chance (probablity we want) of being selected. If it isn't selected, then we use the line that was previously selected, which was either the one before which had a 1 in n-1 chance, correct for selecting from the n-1 remaining values. If the n-1th line wasn't choosen then the line before that had a correct 1 in n-2 chance of being choosen for selecting from the n-2 remaining values. And so on and so on, till you get to 1, which will still be choosen if all others weren't.

    I hope this is understandable.

    update: To attempt to clarify/summerize/whatever after seeing sierrathedog04's response: At any one point in the algorithm, the chance of that line being choosen is correct for if the algorithm ended there, and the chance for the line choosen before staying choosen is correct for if the algorithm ended there (1 out of n, and n-1 out n, respectively.) So though earlier lines are choosen with more liklihood at that point, they may be overridden by the later choices and everything turns out alright.

      As you describe it, lines at the start of the file are more likely to be chosen than are lines at the end of the file. So the answer is no, the distribution of samples collected using the algorithm that you describe will be biased and not equal to the underlying population distribution.
        That's what I'm thinking. Although induction seems to say it works out, I think the selection of the nth line will not equal all the previous choices combined.

        Update: It does. See code in my original reply to this thread. I've not run any statistics on the output, but it looks casually like it is uniform, not obviously biased toward one end or the other.

        —John

Re: Re: Re: Random entry from combined data set
by I0 (Priest) on Jul 04, 2001 at 10:54 UTC
    Do you know the number of lines in advance?
    (if so, it may be more efficient to randomly select a file weighted by the number of lines in each, then randomly select a line from that file)
      Only if the lines are uniformly long, or you have an index to the beginning of each line. The "select a random starting byte" will unfairly bias the lines based on their size (or the size of the line ahead or behind them depending on the algorithm).

      -- Randal L. Schwartz, Perl hacker

        I did not suggest "select a random starting byte"
        I suggested "select a random starting file" (If the number lines in each file is known in advance)
        Then "Select a random line" from that file.