Re: Approach to efficiently choose a random line from a large file
by Zaxo (Archbishop) on Dec 12, 2004 at 05:10 UTC
|
Lets suppose you maintain a separate file (via make, maybe) which holds the line count for your data file. Call it data.count.
use Tie::File;
my $count = do '/path/to/data.count';
tie my @lines, 'Tie::File', '/path/to/data.dat';
print $lines[rand $count];
Tie::File is remarkably efficient at that sort of thing. You could avoid the auxilliary count file by calling rand @lines in the array index.
I don't see any bias for or against the fencepost lines in your proposal. There is a strong bias which favors long lines and punixhes short ones.
Utf-8 is the same as ascii in the ascii range. This code will work fine with either ascii or utf-8. You may want to open your data file first and tie to the open handle. That will let you set up the '<:utf8' mode in PerlIO.
Don't pay any attention to advice to call srand in those links. That is long superceded in perl.
| [reply] [d/l] |
|
|
If you're gonna keep an external file, why not put the (32 bit packed) starting index of each line in it instead of the line count? Get a random 4 byte record, unpack it, seek to that location in the file with the lines, and bingo! unbiased.
| [reply] |
|
|
I thought of doing something similar with an array, but decided the memory cost was too high. That was a guess, but even your much more compact idea of a packed string will occupy 40MB for a ten-million line data file. That's not necessarily prohibitive, but it might be so on a busy machine.
Here's a short bit to write the packed offsets to a file:
#!/usr/bin/perl
open my $out, '>:raw', '/path/to/data.offsets'
or die $!;
open my $in, '<', '/path/to/data.dat'
or die $!;
my $offset = 0;
local ($_,$\);
my ($this, $last) = 0;
while (<$in>) {
($last, $this) = ($this, tell $in);
print $out pack 'i', $last;
}
close $in or warn $!;
close $out or warn $!;
That file can be read and used like this:
my $index = do {
local $/;
open my $idx, '<:raw', '/path/to/data.offsets'
or die $!;
<$idx>
};
open my $dat, '<', '/path/to/data.dat'
or die $!;
my ($offset) = unpack 'i',
substr $index, 4*rand(length($index)/4), 4;
seek $dat, $offset, 0 or warn $!;
print scalar <$dat> or warn $!;
close $dat or warn $!;
close $idx or warn;
| [reply] [d/l] [select] |
Re: Approach to efficiently choose a random line from a large file
by BrowserUk (Patriarch) on Dec 12, 2004 at 06:38 UTC
|
Updated: Changed 'minimum' to 'maximum' as appropriate.
By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.
For a fixed record length file they all have an even chance. If your file contains all the asciized numbers from 1 to 10_000_000, then high values have a greater chance than the lower numbers, regardless of where they are in the file. The lines in many variable-length record datafiles are roughly the same length within some tolorance, so it's not a very bad method of picking if you don't require statistically random picks.
If your file contains variable width chars, then those containing larger numbers of 'big' chars get another bias over those containing only small ones. seek/tell operate at the byte level as to do otherwise would make them intolorably slow.
If you can live with those biases, then the problem becomes how to read the line containing the randomly chosen byte efficiently. For the picking to be reasonably fair, your lines need to be roughly the same length. If you can make a reasonable guess as to the maximum length of any line, then you can minimize the amount of seeking and reading you need to do by:
- Pick a random byte.
- Subtract your maximum guess.
- seek to that position.
- Read a line from that position.
- Read more lines until you've read past the chosen position.
You now have the line containing the randomly chosen byte.
Utilising the knowledge that seek will silently do nothing if the absolute position sought is negative, you can implement that (where $GLEN is you guess at maximum length), as:
open my $fh, '<', $file or die "$file : $!";
my $p = int( 1 + rand -s $fh );
seek $fh, $p - $GLEN, 0;
chomp( my $pick = <$fh> );
chomp( $pick = <$fh> ) while tell( $fh ) < $p;
The more accurate your guess, the less reading required. If the guess is conservatively long, then you will read a few extra lines before finding the required one. Your guess must never be shorter!
The following does a crude 'count the lines picked' analysis:
#! perl -slw
use strict;
our $GLEN ||= 10;
our $TRIES ||= 1000;
sub randLine {
open my $fh, '<', $_[ 0 ] or die "$_[ 0 ] : $!";
my $p = int( 1 + rand -s $fh );
seek $fh, $p - $GLEN, 0;
chomp( my $pick = <$fh> );
chomp( $pick = <$fh> ) while tell( $fh ) < $p;
return $pick;
}
my %stats;
for ( 1 .. $TRIES ) {
$stats{ randLine( $ARGV[ 0 ] ) }++ ;
}
print "$_ : $stats{ $_ }" for sort keys %stats;
The following sample runs show that over-estimating the maximum length makes no difference to the statistical outcome, and highlights the bias towards longer lines (though over-estimates obviously make it less efficient):
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] [select] |
|
|
By picking a random byte position and then using the line that contains that byte as your pick, the bias for or against each line is proportional to it's length versus the average length of the lines in the file.
An easy way to visualize this is to imagine line 1 with 1 character, and line 2 with 99 characters. Ignoring line endings, you'll get the first line 1% of the time, and the second line 99% of the time.
| [reply] |
Re: Approach to efficiently choose a random line from a large file
by ysth (Canon) on Dec 12, 2004 at 05:20 UTC
|
I can't see any seek-based approach as not having some line-length bias (even if not of the length of the chosen line.) Why not just read through the whole file once?
Update: to clarify, I mean using the rand($.) < 1 && ($it = $_) while <>; approach suggested in How do I pick a random line from a file?. On a 10 million line file (*.c from bleadperl top-level directory catted together and repeated 100 times), this took 70 seconds. | [reply] [d/l] |
Re: Approach to efficiently choose a random line from a large file
by Eimi Metamorphoumai (Deacon) on Dec 12, 2004 at 21:01 UTC
|
This came up some time ago on the QOTW list. As others mentioned, the approach you're taking is biased based on line length (a line with 80 characters is twice as likely to be chosen as one with 40 characters). There is another approach that has no bias, but also no bound on runtime, although in practice it should be pretty fast. Basicall, seek to a random byte, and if that byte is a newline character, use the line immediately following it. (If it's the newline at the end of the file, use the first line instead.) If it's not, seek again. Expected time is the average length of the lines (if each line is 20 characters, you'd expect to hit a newline one out of 20 tries), but there's no guarantee you won't be very unlucky and spend forever without hitting one (for suitibly large values of "forever"). But it is an unbiased way of doing it without reading the entire file. | [reply] |
Re: Approach to efficiently choose a random line from a large file
by TedPride (Priest) on Dec 12, 2004 at 21:15 UTC
|
Mmm. Best thing to do is run through your file once getting all line locations and writing them to a new fixed-length file. Then you can randomize on that file and easily pop your line out of the original file whenever needed. I'm assuming that with millions of lines to randomly pick from, nobody is going to notice that your line pool is only updated every x number of days.
Or you could convert the entire original file to a fixed-length file, but this requires that the lines be more or less the same length, or you're going to end up using huge amounts of disk space. | [reply] |
Re: Approach to efficiently choose a random line from a large file
by periapt (Hermit) on Dec 13, 2004 at 13:47 UTC
|
Kuth, in The Art of Computer Programming (Vol II p144 3rd ed.) gives a simple algorithm for selection lines from a file when the total number of lines isn't known ahead of time. How do I pick a random line from a file? seems to be a variation of that idea. It might be worth looking at again.
If you know the number of lines in the file ahead of time, you could try something like this
my $t = 0; # nr of lines already considered
my $N = 10; # nr of lines in file
do{
my $line = <>;
}until(($N-$t++)*rand < 1);
print $line;
This assumes you are selecting only one line (there is a simple mod if you want more than one, say m, lines) . You step through the file one line at a time and decide whether to select the current line. If not, you move on to the next one and so on. On average, you will only have to read (N+1)n/(n+1) lines before selecting one. If your files contain 10 million records, you would, on average only read about 5 million lines before selecting one. How do I pick a random line from a file? requires a complete pass through all lines each time. Depending on your setup or requirements, the savings might be worth it.
Update: included the incrementor for $t
PJ
use strict; use warnings; use diagnostics;
| [reply] [d/l] |
Re: Approach to efficiently choose a random line from a large file
by johndageek (Hermit) on Dec 13, 2004 at 17:50 UTC
|
Perhaps I do not understand something, but if you know how many records you will want from the file, my choice would be to create an array of random numbers, then sort the array. Then read the file sequentially, selecting the records determined by the list of random numbers.
HTH
| [reply] |
|
|
...my choice would be to create an array of random numbers...
But if you don't know how many lines are in the file, in what range do you generate your random numbers?
If you do know how many records are in the file, and and you randomly pick the last line as one of your choices, you now have to read the whole file to get your choices.
You get the same problem with using Tie::File. If you randomly pick the last line, you have to read the whole 10 million lines, in order to get your chosen line.
The same problem occurs with Knuth's algorithm and the FAQ variation. On average, you are having to read 5 million lines from the file to get one random choice. If you need two random lines, you (on average) will have to read 10 millions lines.
Using the algorithm described by the OP, then he should have to, atmost, read 3 or 4 lines for each choice.
Given a suitable guestimate for the maximum length of line, using the implementation I described above, that should be reduceable to 1 or 2 lines.
The concern is that the variability in line length will unduly bias the choice of lines made, but providing the lines are roughly the same length, and the requirement is to only choose a few lines each time, the effect of this bias will be indistinguishable from natural randomness.
For sake of discussion lets assume the lines average 40-chars: 10_000_000 x (ave. 40) = 400_000_000
Lets say the file consisted of half 20-char lines and half 60-char lines for an average of 40 chars.
The odds of picking a line with 20 chars is: 20 * 100 / 400_000_000 = 0.000005%
The odds of picking a line with 60 chars is: 60 * 100 / 400_000_000 = 0.000015%
Now assume that lengths are evenly spread over a range between 30 and 50.
The odds of picking a given line range, by it's length, from:
476190 x 30 = 0.0000075 %
476191 x 31 = 0.00000775 %
476190 x 32 = 0.000008 %
476191 x 33 = 0.00000825 %
476190 x 34 = 0.0000085 %
476191 x 35 = 0.00000875 %
476190 x 36 = 0.000009 %
476191 x 37 = 0.00000925 %
476190 x 38 = 0.0000095 %
476191 x 39 = 0.00000975 %
476190 x 40 = 0.00001 %
476191 x 41 = 0.00001025 %
476190 x 42 = 0.0000105 %
476191 x 43 = 0.00001075 %
476190 x 44 = 0.000011 %
476191 x 45 = 0.00001125 %
476190 x 46 = 0.0000115 %
476191 x 47 = 0.00001175 %
476190 x 48 = 0.000012 %
476191 x 49 = 0.00001225 %
476190 x 50 = 0.0000125 %
As you can see, the odds of picking any single given line, regardless of it's length, are miniscule.
If you are only picking one, or a few of those lines for a given run, then the small variations in the odds of picking the shorter lines relative to the longer lines is so insignificant, that biases inherent in most pseudo-random number generators are more likely to influence the outcomes, than those variations.
And if you are using a truely random source of random numbers, then it's natural variations will likewise dominate the choices made until you have made several billion picks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |