Re: A series of random number and others
by GrandFather (Saint) on Oct 09, 2008 at 01:25 UTC
|
Yep, that's pretty stupid code all right. Actually we seem to be running about one or two questions a week that look like that so don't feel too bad. The answer is generally the same: use a hash. However in this case there is a twist: "20 million lines from 40 million" is a fairly bif hash. But there may be a smarter solution: if you don't need exactly 20 million lines, then you can:
rand () < 0.5 and print while <>;
If you need the hash then your current code changes to something like:
open my $rndLines, '<', "rand_sorted.txt" or die "Can't open rand_sort
+ed.txt: $!";
my %lines = map {$_ => undef} grep {chomp; length} <$rndLines>;
close $rndLines;
exists $lines{$.} and print while <>;
I find it interesting that you couldn't generate a random number list using Perl however. What was the issue that you struck? Are you aware of Math::Random::MT btw?
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
Re: A series of random number and others
by blokhead (Monsignor) on Oct 09, 2008 at 01:02 UTC
|
I just posted a simple method to Randomly select N lines from a file, on the fly, as a snippet since I wasn't able to find it anywhere else here on the site. Perhaps I just missed it -- hope I haven't duplicated any effort. It is a generalization of the "usual" trick for choosing 1 random line from a file, on the fly.
| [reply] |
Re: A series of random number and others
by BrowserUk (Patriarch) on Oct 09, 2008 at 07:20 UTC
|
The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it.
This code
- Generates a packed list of 40e6 integers (315MB) (40e6*4*2 (because of the way Perl initialises scalars)).
- F-Y shuffles that list in-place (+0MB) 2 minutes.
- "Sorts" the first 20e6 of the shuffled integers by building a bitvector. (+5MB) 30 seconds.
- Reads the file and print a line if it bit is set in the bitvector.
In total, <320MB and ~2.5 minutes on my machine.
#! perl -slw
use strict;
use Math::Random::Mt qw[ rand ];
our $N ||= 40e6;
our $K ||= $N / 2;
## In-place FY shuffle of a packed ULONG array
sub shuffle {
my $ref = shift;
my $n = length( $$ref ) >> 2;
for( 0 .. $n ) {
my $p = $_ + int( rand $n - $_ );
my $tmp = substr $$ref, $_*4, 4;
substr $$ref, $_*4, 4, substr $$ref, $p*4, 4;
substr $$ref, $p*4, 4, $tmp;
}
return;
}
## Allocate a packed array of $N Ulongs
my $index = \ (chr(0) x ( 4 * $N ));
## Populate it
substr $$index, $_*4, 4, pack 'V', $_ for 0 .. $N - 1;
## Shuffle in place
shuffle $index;
## Allocate a bit-vector for linear sort
my $ordered = chr(0) x int( ( $N +7 ) / 8 );
## Linear sort
vec( $ordered, unpack( 'V', substr $$index, $_*4, 4 ), 1 ) = 1
for 0 .. $K - 1;
## Read the file line by line
while( <> ) {
## print it if it was picked
print if vec( $ordered, $., 1 );
}
The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process.
With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: A series of random number and others (10MB / 8seconds)
by BrowserUk (Patriarch) on Oct 09, 2008 at 10:24 UTC
|
This version fairly picks (exactly) 50% from 40e6 in 8 seconds*(10MB); 50% from 100e6 in 22 seconds*(25MB); 50% from 1e9 in under 4 minutes*(1/2GB). It can handle all the way up to 25e9 using <3GB in under 2 hours*:
(*)plus the time taken to read the file.
#! perl -slw
use strict;
use Math::Random::MT qw[ rand ];
use constant FOUR_GB => 2**32;
$|++;
our $N ||= 40e6;
our $K ||= $N / 2;
## A bit vector big enough to hold any number in the range
my $vector = chr(0) x (( $N + 7 ) / 8);
## Populate it with random bit values
## As we are after 50% this will get us very close very quickly
substr $vector, $_* 4, 4, pack( 'V', rand( FOUR_GB ) )
for 0 .. ( length( $vector ) / 4 ) -1;
## How many did we actually pick?
my $nBits = unpack '%32b*', $vector;
## Now (fairly) adjust up or down accordingly
while( $nBits++ < $K ) {
## pick bit to set
my $r = int( rand $N );
## If it is already set discount it
$nBits -= vec( $vector, $r, 1 );
## and set it anyway
vec( $vector, $r, 1 )=1;
}
while( $nBits-1 > $K ) {
## pick bit to clear
my $r = int( rand $N );
## If it is set discount it
$nBits -= vec( $vector, $r, 1 );
## and clear it anyway
vec( $vector, $r, 1 ) = 0;
}
## Read the file line by line
$\=undef;
while( <> ) {
## print it if it was picked
print if vec( $vector, $., 1 );
}
It can also handle ratios other than 50%, but will start getting far less efficient as you move further away from 50% (up or down).
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: A series of random number and others
by salva (Canon) on Oct 09, 2008 at 08:46 UTC
|
The rand built-in is not the only random number generator available from Perl. There are several CPAN modules implementing different algorithms, for instance: Math::GSL::RNG or Math::RngStream. | [reply] [d/l] |
Re: A series of random number and others
by cdarke (Prior) on Oct 09, 2008 at 08:30 UTC
|
I admit my solution is not mathematically pure, and is similar to Grandfather's first suggestion. Rather than storing up all those numbers, my approach is to note that we have to choose half of the lines. So do we use a line or not? 50/50 chance. That could easily give a shortfall, but I have a hack for that: use warnings;
use strict;
my $limit = 40_000_000;
my $how_many = $limit/2;
my $hits = 0;
open (my $handle, 'rand_sorted.txt') or die "Unable to open rand_sorte
+d.txt: $!";
# Using a C style loop to avoid a large temp list (Grandfather)
for (my $i = 0; $i < $limit && $hits < $how_many; $i++)
{
my $record = <$handle>;
if (rand(2) > 1 || ($limit-$i) <= $hits ) {
$hits++;
# Do what you must to the record
print "$i: $record\n";
}
}
close ($handle);
| [reply] [d/l] |
|
|
| [reply] [d/l] |
Re: A series of random number and others
by drblove27 (Sexton) on Oct 09, 2008 at 01:49 UTC
|
Well for a straight forward way Perl (at least to my non-expert Perl skills) way, you can do this:
@lines_of_code # Say this is your 40,000,000 lines of code
my $i;
my @rlc; ### Random lines of code
for ($i = 0; $i < 20000000; $i++){
$rlc[$i] = $loc[int(rand(40000000)+1)];
}
That should do the trick, may not be the fastest thing in the world... | [reply] [d/l] |
|
|
use strict;
use warnings;
my %randLines;
++$randLines{1 + int rand 40_000_000} for 1 .. 20_000_000;
my @hits = map {[$_, $randLines{$_}]}
sort {$randLines{$b} <=> $randLines{$a}} keys %randLines;
print scalar @hits, " different lines selected\n";
print "Line $_->[0] hit $_->[1] times\n" for @hits[0 .. 9];
Prints:
32768 different lines selected
Line 3532715 hit 724 times
Line 24512940 hit 718 times
Line 20959473 hit 712 times
Line 28502198 hit 705 times
Line 4688721 hit 704 times
Line 37175293 hit 700 times
Line 26921387 hit 699 times
Line 28406983 hit 699 times
Line 3172608 hit 696 times
Line 31093751 hit 695 times
Note in particular that only 32768 (very close to 215 btw) different lines were selected out of the 20,000,000 the OP was after!
The actual results will vary with the specific build of Perl with the result shown being about the worst you are likely to encounter and are because the rand for the build of Perl used to run the sample uses only a 15 bit value. Much better results are obtained if you add the line use Math::Random::MT qw(rand srand); to the sample, but be prepared to wait a while and make sure your system has lots of memory available.
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
|
|
Thanks GrandFather. Your understanding is right, these lines must not be duplicate. I just began to get used to perl,and am not familiar with these modules. It seems I have to study these modules thoroughly.
As your suggestions of using hash or array, I have tried, but I encountered the memory problem. My machine can not even slurp 40 million lines into an array. That's why I create this index file first.
So, under such a condition considering memory limitation (3GB). what could be the fastest resolution for such a case? Thank you.
| [reply] |
|
|
|
|
| [reply] |
|
|