grabbing random n rows from a file

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am faced with a problem that is a variation of recipe "8.6. Picking a Random Line from a File" from the venerable cookbook ( clever use of rand($.) )

I have a file with "n" sets of "m" rows (lets assume they are sorted by the token that makes them into a set... so, if the rows have some attributes about people, the first token in each row is the name of the person, and there are "m" rows for, say 'punkish' and another "m" rows for 'paco', and so on). I want to grab random "j" rows from each set and write "n" sets of "j" rows out to another file.

I apologize that I am even unable to offer pseudo code to try and figure it out. It would be trivial to do it in a db, but I wouldn't mind knowing how to do this with just a file and the magic of Perl.

Oh! did I mention that (n * m) is a very large number, that is, we are talking about a file with around 8 million rows.

Update: on second glance, this post should really be titled grabbing random "j" rows from a file... oh well.

Update 2: (after bonking himself on the head for not providing a "compleat" problem the first time)

Type of file: It is a delimited (say, CSV) file
Are the lines fixed length?: No, but each row has the same number of fields, just like a CSV file
Are there a fixed number of lones per "record"? Dunno what a "lone" is.
Is any of this stuff indexed? It is a text file. How could it be index?
Is this something that you need to do one off (or occasionally)? Occasionally... that is why the need for a program. But, would prefer to not use a database such as SQLite.
Does the "data base" change over time? Yes, periodically. But, for every run, it is one, immutable file.
If it changes can "records" be inserted? Can't change the input file.
Why isn't this in a real database? Well, too long to answer here. Eventually it ends up in a database, so this is just the preprocessing part, but it is preferred to not do the preprocessing in a database.
Does j change for each set, or do you want to print the same number of lines for each set? "j" doesn't change, however, the number of rows in a set may change. Incoming rows are supposed to be, say, 100 per set, and "j" is fixed at, say 90, but it is possible that a set might have only 80 rows, in which case, all 80 will be chosen. In other words, choose random "j" out of "m" if (j < m) else choose "m"

--

when small people start casting long shadows, it is time to go to bed

Comment on grabbing random n rows from a file Select or Download Code

Replies are listed 'Best First'.
Re: grabbing random n rows from a file by japhy (Canon) on Jul 18, 2006 at 21:57 UTC
I think this method is accurate and fair: `open my $some_filehandle, "<", "quotefile.txt"; my $set_size = 3; my $set = random_set_of_n($some_filehandle, $set_size); sub random_set_of_n { my ($fh, $size) = @_; my @set; local ($., $_); seek $fh, 0, 0; while (<$fh>) { chomp; push @set, $_; last if @set == $size; } # XXX: @set should be shuffled now if you care about ordering while (<$fh>) { chomp; $set[rand @set] = $_ if $size/$. > rand; } return \@set; }` [download] I think it's a fair distribution. My tests imply it is. Update: the set should be shuffled where I've indicated. It's not necessary if you're going to be plucking elements from it at random later on, though, only if you want a randomly ordered list returned. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l]
Re: grabbing random n rows from a file by ikegami (Patriarch) on Jul 18, 2006 at 22:48 UTC
The fact that your sets are grouped is a great benefit. We can work with a set at a time. The fact that your sets are of varying length is a great hindrance. Work needs to be done to locate the end of each set. I made the following assumptions: `m` (and thus `j`) is rather small. Specifically, keeping `m` lines in memory is not a problem. ( Confirmed in "Update 2". ) `m` is not the same for every set. ( Confirmed in "Update 2". ) You don't want the same random `j` lines from every set. It's a minor change if you do. You don't care if the random `j` lines are in their original order. It's a minor change if you do. My solution: `use strict; use warnings; use List::Util qw( shuffle ); my $j = 90; sub extract_id { my ($line) = @_; ... return ...; } my @m; my $id; my $last_id; for (;;) { my $line = <DATA>; $id = extract_id($line) if defined($line); if (@m) { if (!defined($line) \|\| $id ne $last_id) { my $j = $j < @m ? $j : @m; print $m[$_] foreach (shuffle(0..$#m))[0..$j-1]; @m = (); } } last if !defined($line); push(@m, $line); $last_id = $id; }` [download] Untested. (Update: Tested. Fixed. ) Memory can be saved by stored file positions in `@m` instead of the actual lines, but that's not needed based on your "Update 2". Alternative: `print splice(@m, rand(@m), 1) while $j--;`	[reply] [d/l] [select]
Re: grabbing random n rows from a file by GrandFather (Saint) on Jul 18, 2006 at 21:46 UTC
Are the lines fixed length? Are there a fixed number of lines per "record"? Is any of this stuff indexed? Is this something that you need to do one off (or occasionally)? Does the "data base" change over time? If it changes can "records" be inserted? Why isn't this in a real database? Without knowing the answers to any of the above here are a couple of approaches: do an indexing pass through the file to build a hash keyed by "person" containing the file position of the start of the record and positions of lines in the record. Seek through the file pulling out randomly selected j lines from each record as required. Scan through the file identifying record starts and create an array of lines within the current record. At the end of each record spit out j random lines as required from the array. The first variant is more appropriate if you want to do this multiple times without the "data base" changing between times - keep the index. The second variant is more appropriate for one off use or if the contents of the "data base" is changing and maintaining an index is not fesable. Update: s/lones/lines/ DWIM is Perl's answer to Gödel	[reply]
Re: grabbing random n rows from a file by Hue-Bond (Priest) on Jul 18, 2006 at 21:58 UTC
This reads `$j` lines, then skips until `$m` is reached, then starts reading again. Prints to `STDOUT`, not to a new file. But that isn't your main concern, is it? ;^). Read more... (614 Bytes) Update: Oops, missed the "random" bit. Let's try this: `#!/usr/bin/perl use warnings; use strict; use List::Util qw/shuffle/; my $m = 3; my $j = 2; MAIN: while (1) { my @set; for (my $i = 0; $i < $m ; $i++) { last MAIN unless defined ($_ = <DATA>); push @set, $_; } for ((shuffle @set)[0..$j-1]) { print; } } __DATA__ foo 1 2 3 foo 2 3 4 foo 4 5 6 bar a b c bar b c d bar c d e baz q w e baz w e r baz e r t` [download] -- David Serrano	[reply] [d/l] [select]
Re^2: grabbing random n rows from a file by ikegami (Patriarch) on Jul 18, 2006 at 22:50 UTC
According to the OP's "Update 2", `m` is neither neither known nor constant. japhy made the same incorrect assumption above.	[reply] [d/l]
Re: grabbing random n rows from a file by perrin (Chancellor) on Jul 18, 2006 at 21:50 UTC
I think Tie::File is the answer here.	[reply]
Re: grabbing random n rows from a file by swkronenfeld (Hermit) on Jul 18, 2006 at 21:53 UTC
Does j change for each set, or do you want to print the same number of lines for each set? It sounds like the amount of lines for each set is the same. Assuming that you want to print the same amount of lines for each set, than do the following -read an entire set, let $m = the number of lines read -let `$j = int(rand($m))+1` -print out $j lines of the first set that you've already read -now read the file line by line, buffering the input, and after you've read out $m lines, print out ~~the first $j lines for each set, and throw away the remaining ($m-$j) lines.~~ j random lines from each set. Then you can throw away your buffer and start over. As for the size of the file, just don't slurp the entire file into memory, read it line by line. Have I answered your question? I have a feeling that I'm oversimplifying it. If so, please give a few more details. Updated	[reply] [d/l]
Re: grabbing random n rows from a file by ambrus (Abbot) on Jul 19, 2006 at 08:51 UTC
See the thread improving the efficiency of a script on taking random samples.	[reply]
Re: grabbing random n rows from a file by Anonymous Monk on Jul 19, 2006 at 18:41 UTC
`#!/usr/bin/perl use strict; use warnings; my ($fh,$href,$aref,$group,$array); open($fh,"</tmp/kinput"); $href = {}; while (<$fh>) { $aref = [split(/,/,$_)]; push(@{$$href{$$aref[0]}},$aref); } while ( ($group,$array) = each(%$href) ) { $aref = $$array[rand($#$array)]; print join(",",@$aref); }` [download] You can reduce memory consumption by working on a "set" at a time versus building a massive hash, but you'll need to define your sets beforehand and use grep or something else to get lines from the set into the hash. This will of course drive cpu usage up.	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks