in reply to Re: Re: Random entry from combined data set
in thread Random entry from combined data set

Yes. Say we go over n lines in that algorithm. The last line has a 1 in n chance (probablity we want) of being selected. If it isn't selected, then we use the line that was previously selected, which was either the one before which had a 1 in n-1 chance, correct for selecting from the n-1 remaining values. If the n-1th line wasn't choosen then the line before that had a correct 1 in n-2 chance of being choosen for selecting from the n-2 remaining values. And so on and so on, till you get to 1, which will still be choosen if all others weren't.

I hope this is understandable.

update: To attempt to clarify/summerize/whatever after seeing sierrathedog04's response: At any one point in the algorithm, the chance of that line being choosen is correct for if the algorithm ended there, and the chance for the line choosen before staying choosen is correct for if the algorithm ended there (1 out of n, and n-1 out n, respectively.) So though earlier lines are choosen with more liklihood at that point, they may be overridden by the later choices and everything turns out alright.

  • Comment on Re: Re: Re: Random entry from combined data set

Replies are listed 'Best First'.
Re: Re: Re: Re: Random entry from combined data set
by sierrathedog04 (Hermit) on Jul 04, 2001 at 18:25 UTC
    As you describe it, lines at the start of the file are more likely to be chosen than are lines at the end of the file. So the answer is no, the distribution of samples collected using the algorithm that you describe will be biased and not equal to the underlying population distribution.
      That's what I'm thinking. Although induction seems to say it works out, I think the selection of the nth line will not equal all the previous choices combined.

      Update: It does. See code in my original reply to this thread. I've not run any statistics on the output, but it looks casually like it is uniform, not obviously biased toward one end or the other.

      —John