in reply to Random entry from combined data set

The FAQ How do I select a random line from a file? answers the largest part of your question. In fact, you can use the code they give if you just set @ARGV to the list of files to read. (Thanks to how $. works with <>.) And you don't hog memory.

You can probably make your script run faster if you create (in advance) and use in your script some data about the files that allows you to choose a line without going through every signle one and possibly seek to a specific line quicker.

Replies are listed 'Best First'.
Re: Re: Random entry from combined data set
by John M. Dlugosz (Monsignor) on Jul 04, 2001 at 08:09 UTC
    Cute. Does that really give the same distribution as knowing the number of lines in advance?

    Update: The definitive answer

    use strict; use warnings; sub trial($) { my $result; foreach (1..shift) { $result= $_ if rand($_) < 1; } return $result; } ############ my $trials= shift || 1000; my $size= shift || 20; my %results; for (1..$trials) { ++$results{trial($size)}; } # show the results for (1..$size) { printf "%5d: %5d\n", $_, ($results{$_} || 0); }
      Yes. Say we go over n lines in that algorithm. The last line has a 1 in n chance (probablity we want) of being selected. If it isn't selected, then we use the line that was previously selected, which was either the one before which had a 1 in n-1 chance, correct for selecting from the n-1 remaining values. If the n-1th line wasn't choosen then the line before that had a correct 1 in n-2 chance of being choosen for selecting from the n-2 remaining values. And so on and so on, till you get to 1, which will still be choosen if all others weren't.

      I hope this is understandable.

      update: To attempt to clarify/summerize/whatever after seeing sierrathedog04's response: At any one point in the algorithm, the chance of that line being choosen is correct for if the algorithm ended there, and the chance for the line choosen before staying choosen is correct for if the algorithm ended there (1 out of n, and n-1 out n, respectively.) So though earlier lines are choosen with more liklihood at that point, they may be overridden by the later choices and everything turns out alright.

        As you describe it, lines at the start of the file are more likely to be chosen than are lines at the end of the file. So the answer is no, the distribution of samples collected using the algorithm that you describe will be biased and not equal to the underlying population distribution.
      Do you know the number of lines in advance?
      (if so, it may be more efficient to randomly select a file weighted by the number of lines in each, then randomly select a line from that file)
        Only if the lines are uniformly long, or you have an index to the beginning of each line. The "select a random starting byte" will unfairly bias the lines based on their size (or the size of the line ahead or behind them depending on the algorithm).

        -- Randal L. Schwartz, Perl hacker