Re^2: Approach to efficiently choose a random line from a large file

...my choice would be to create an array of random numbers...

But if you don't know how many lines are in the file, in what range do you generate your random numbers?

If you do know how many records are in the file, and and you randomly pick the last line as one of your choices, you now have to read the whole file to get your choices.

You get the same problem with using Tie::File. If you randomly pick the last line, you have to read the whole 10 million lines, in order to get your chosen line.

The same problem occurs with Knuth's algorithm and the FAQ variation. On average, you are having to read 5 million lines from the file to get one random choice. If you need two random lines, you (on average) will have to read 10 millions lines.

Using the algorithm described by the OP, then he should have to, atmost, read 3 or 4 lines for each choice.

Given a suitable guestimate for the maximum length of line, using the implementation I described above, that should be reduceable to 1 or 2 lines.

The concern is that the variability in line length will unduly bias the choice of lines made, but providing the lines are roughly the same length, and the requirement is to only choose a few lines each time, the effect of this bias will be indistinguishable from natural randomness.

For sake of discussion lets assume the lines average 40-chars: 10_000_000 x (ave. 40) = 400_000_000

Lets say the file consisted of half 20-char lines and half 60-char lines for an average of 40 chars.

The odds of picking a line with 20 chars is: 20 * 100 / 400_000_000 = 0.000005%

The odds of picking a line with 60 chars is: 60 * 100 / 400_000_000 = 0.000015%

Now assume that lengths are evenly spread over a range between 30 and 50.

The odds of picking a given line range, by it's length, from:

476190 x 30 = 0.0000075  %
476191 x 31 = 0.00000775 %
476190 x 32 = 0.000008   %
476191 x 33 = 0.00000825 %
476190 x 34 = 0.0000085  %
476191 x 35 = 0.00000875 %
476190 x 36 = 0.000009   %
476191 x 37 = 0.00000925 %
476190 x 38 = 0.0000095  %
476191 x 39 = 0.00000975 %
476190 x 40 = 0.00001    %
476191 x 41 = 0.00001025 %
476190 x 42 = 0.0000105  %
476191 x 43 = 0.00001075 %
476190 x 44 = 0.000011   %
476191 x 45 = 0.00001125 %
476190 x 46 = 0.0000115  %
476191 x 47 = 0.00001175 %
476190 x 48 = 0.000012   %
476191 x 49 = 0.00001225 %
476190 x 50 = 0.0000125  %
[download]

As you can see, the odds of picking any single given line, regardless of it's length, are miniscule.

If you are only picking one, or a few of those lines for a given run, then the small variations in the odds of picking the shorter lines relative to the longer lines is so insignificant, that biases inherent in most pseudo-random number generators are more likely to influence the outcomes, than those variations.

And if you are using a truely random source of random numbers, then it's natural variations will likewise dominate the choices made until you have made several billion picks.

Examine what is said, not who speaks. The end of an era!

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^2: Approach to efficiently choose a random line from a large file Download Code