A series of random number and others

lightoverhead has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 01:25 UTC
Yep, that's pretty stupid code all right. Actually we seem to be running about one or two questions a week that look like that so don't feel too bad. The answer is generally the same: use a hash. However in this case there is a twist: "20 million lines from 40 million" is a fairly bif hash. But there may be a smarter solution: if you don't need exactly 20 million lines, then you can: `rand () < 0.5 and print while <>;` [download] If you need the hash then your current code changes to something like: `open my $rndLines, '<', "rand_sorted.txt" or die "Can't open rand_sort +ed.txt: $!"; my %lines = map {$_ => undef} grep {chomp; length} <$rndLines>; close $rndLines; exists $lines{$.} and print while <>;` [download] I find it interesting that you couldn't generate a random number list using Perl however. What was the issue that you struck? Are you aware of Math::Random::MT btw? Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: A series of random number and others by blokhead (Monsignor) on Oct 09, 2008 at 01:02 UTC
I just posted a simple method to Randomly select N lines from a file, on the fly, as a snippet since I wasn't able to find it anywhere else here on the site. Perhaps I just missed it -- hope I haven't duplicated any effort. It is a generalization of the "usual" trick for choosing 1 random line from a file, on the fly. blokhead	[reply]
Re: A series of random number and others by BrowserUk (Patriarch) on Oct 09, 2008 at 07:20 UTC
The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it. This code Generates a packed list of 40e6 integers (315MB) (40e642 (because of the way Perl initialises scalars)). F-Y shuffles that list in-place (+0MB) 2 minutes. "Sorts" the first 20e6 of the shuffled integers by building a bitvector. (+5MB) 30 seconds. Reads the file and print a line if it bit is set in the bitvector. In total, <320MB and ~2.5 minutes on my machine. #! perl -slw use strict; use Math::Random::Mt qw[ rand ]; our $N \|\|= 40e6; our $K \|\|= $N / 2; ## In-place FY shuffle of a packed ULONG array sub shuffle { my $ref = shift; my $n = length( $$ref ) >> 2; for( 0 .. $n ) { my $p = $_ + int( rand $n - $_ ); my $tmp = substr $$ref, $_4, 4; substr $$ref, $_4, 4, substr $$ref, $p4, 4; substr $$ref, $p4, 4, $tmp; } return; } ## Allocate a packed array of $N Ulongs my $index = \ (chr(0) x ( 4 * $N )); ## Populate it substr $$index, $_4, 4, pack 'V', $_ for 0 .. $N - 1; ## Shuffle in place shuffle $index; ## Allocate a bit-vector for linear sort my $ordered = chr(0) x int( ( $N +7 ) / 8 ); ## Linear sort vec( $ordered, unpack( 'V', substr $$index, $_4, 4 ), 1 ) = 1 for 0 .. $K - 1; ## Read the file line by line while( <> ) { ## print it if it was picked print if vec( $ordered, $., 1 ); } [download] The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process. With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: A series of random number and others (10MB / 8seconds) by BrowserUk (Patriarch) on Oct 09, 2008 at 10:24 UTC
This version fairly picks (exactly) 50% from 40e6 in 8 seconds(10MB); 50% from 100e6 in 22 seconds(25MB); 50% from 1e9 in under 4 minutes(1/2GB). It can handle all the way up to 25e9 using <3GB in under 2 hours: ()plus the time taken to read the file. #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use constant FOUR_GB => 232; $\|++; our $N \|\|= 40e6; our $K \|\|= $N / 2; ## A bit vector big enough to hold any number in the range my $vector = chr(0) x (( $N + 7 ) / 8); ## Populate it with random bit values ## As we are after 50% this will get us very close very quickly substr $vector, $_ 4, 4, pack( 'V', rand( FOUR_GB ) ) for 0 .. ( length( $vector ) / 4 ) -1; ## How many did we actually pick? my $nBits = unpack '%32b*', $vector; ## Now (fairly) adjust up or down accordingly while( $nBits++ < $K ) { ## pick bit to set my $r = int( rand $N ); ## If it is already set discount it $nBits -= vec( $vector, $r, 1 ); ## and set it anyway vec( $vector, $r, 1 )=1; } while( $nBits-1 > $K ) { ## pick bit to clear my $r = int( rand $N ); ## If it is set discount it $nBits -= vec( $vector, $r, 1 ); ## and clear it anyway vec( $vector, $r, 1 ) = 0; } ## Read the file line by line $\=undef; while( <> ) { ## print it if it was picked print if vec( $vector, $., 1 ); } [download] It can also handle ratios other than 50%, but will start getting far less efficient as you move further away from 50% (up or down). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: A series of random number and others by salva (Canon) on Oct 09, 2008 at 08:46 UTC
The `rand` built-in is not the only random number generator available from Perl. There are several CPAN modules implementing different algorithms, for instance: Math::GSL::RNG or Math::RngStream.	[reply] [d/l]
Re: A series of random number and others by cdarke (Prior) on Oct 09, 2008 at 08:30 UTC
I admit my solution is not mathematically pure, and is similar to Grandfather's first suggestion. Rather than storing up all those numbers, my approach is to note that we have to choose half of the lines. So do we use a line or not? 50/50 chance. That could easily give a shortfall, but I have a hack for that: `use warnings; use strict; my $limit = 40_000_000; my $how_many = $limit/2; my $hits = 0; open (my $handle, 'rand_sorted.txt') or die "Unable to open rand_sorte +d.txt: $!"; # Using a C style loop to avoid a large temp list (Grandfather) for (my $i = 0; $i < $limit && $hits < $how_many; $i++) { my $record = <$handle>; if (rand(2) > 1 \|\| ($limit-$i) <= $hits ) { $hits++; # Do what you must to the record print "$i: $record\n"; } } close ($handle);` [download]	[reply] [d/l]
Re^2: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 08:53 UTC
Perl is smart about `for (1 .. $count)` and doesn't generate a list so you don't need to resort to nasty C for loops, at least in that case. ;) Perl reduces RSI - it saves typing	[reply] [d/l]
Re: A series of random number and others by drblove27 (Sexton) on Oct 09, 2008 at 01:49 UTC
Well for a straight forward way Perl (at least to my non-expert Perl skills) way, you can do this: `@lines_of_code # Say this is your 40,000,000 lines of code my $i; my @rlc; ### Random lines of code for ($i = 0; $i < 20000000; $i++){ $rlc[$i] = $loc[int(rand(40000000)+1)]; }` [download] That should do the trick, may not be the fastest thing in the world...	[reply] [d/l]
Re^2: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 02:21 UTC
Apart from not being what the OP wants and not being Perl so much a transliterated C, that code actually demonstrates the problem the OP was most likely to have had with the default Perl rand. Consider: `use strict; use warnings; my %randLines; ++$randLines{1 + int rand 40_000_000} for 1 .. 20_000_000; my @hits = map {[$_, $randLines{$_}]} sort {$randLines{$b} <=> $randLines{$a}} keys %randLines; print scalar @hits, " different lines selected\n"; print "Line $_->[0] hit $_->[1] times\n" for @hits[0 .. 9];` [download] Prints: `32768 different lines selected Line 3532715 hit 724 times Line 24512940 hit 718 times Line 20959473 hit 712 times Line 28502198 hit 705 times Line 4688721 hit 704 times Line 37175293 hit 700 times Line 26921387 hit 699 times Line 28406983 hit 699 times Line 3172608 hit 696 times Line 31093751 hit 695 times` [download] Note in particular that only 32768 (very close to 2¹⁵ btw) different lines were selected out of the 20,000,000 the OP was after! The actual results will vary with the specific build of Perl with the result shown being about the worst you are likely to encounter and are because the rand for the build of Perl used to run the sample uses only a 15 bit value. Much better results are obtained if you add the line `use Math::Random::MT qw(rand srand);` to the sample, but be prepared to wait a while and make sure your system has lots of memory available. Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: A series of random number and others by lightoverhead (Pilgrim) on Oct 09, 2008 at 05:49 UTC
Thanks GrandFather. Your understanding is right, these lines must not be duplicate. I just began to get used to perl,and am not familiar with these modules. It seems I have to study these modules thoroughly. As your suggestions of using hash or array, I have tried, but I encountered the memory problem. My machine can not even slurp 40 million lines into an array. That's why I create this index file first. So, under such a condition considering memory limitation (3GB). what could be the fastest resolution for such a case? Thank you.	[reply]
Re^4: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 06:07 UTC
Re^3: A series of random number and others by Narveson (Chaplain) on Oct 09, 2008 at 06:36 UTC
32768 (very close to 2¹⁵ btw) How close do you need to be?	[reply]
Re^4: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 06:52 UTC