Dear Monks,
I tried to randomly select 20 million lines from a 40 million lines file.
I have two questions.
First, how to select the 20 million lines from the file without bias. I have tried to use perl. But it seems perl is not that good for such thing, so I used R to generate 20 million index to create an index file (rand_sorted.txt) and use it to print the selected lines.
Second, my code is as below:
#!/usr/bin/perl -w use strict; open RND, "rand_sorted.txt" or die "no file found"; my @rnd; @rnd = grep {chomp;} <RND>; close RND; WILLOP: while (<>){ chomp; FORLOP: foreach my $i (@rnd){ if ($i != $rnd[$#rnd]){ if ($. == $i){ print "$_\n"; shift @rnd; last FORLOP; } else { print "$_\n"; last FORLOP; } }
This works fine and fast for me, but I just hate these two loops nested each other, it looks stupid, or maybe it's stupid. Any one can give me an idea how to do it? Thanks.
In reply to A series of random number and others by lightoverhead
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |