Re: A series of random number and others

Replies are listed 'Best First'.
Re^2: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 02:21 UTC
Apart from not being what the OP wants and not being Perl so much a transliterated C, that code actually demonstrates the problem the OP was most likely to have had with the default Perl rand. Consider: `use strict; use warnings; my %randLines; ++$randLines{1 + int rand 40_000_000} for 1 .. 20_000_000; my @hits = map {[$_, $randLines{$_}]} sort {$randLines{$b} <=> $randLines{$a}} keys %randLines; print scalar @hits, " different lines selected\n"; print "Line $_->[0] hit $_->[1] times\n" for @hits[0 .. 9];` [download] Prints: `32768 different lines selected Line 3532715 hit 724 times Line 24512940 hit 718 times Line 20959473 hit 712 times Line 28502198 hit 705 times Line 4688721 hit 704 times Line 37175293 hit 700 times Line 26921387 hit 699 times Line 28406983 hit 699 times Line 3172608 hit 696 times Line 31093751 hit 695 times` [download] Note in particular that only 32768 (very close to 2¹⁵ btw) different lines were selected out of the 20,000,000 the OP was after! The actual results will vary with the specific build of Perl with the result shown being about the worst you are likely to encounter and are because the rand for the build of Perl used to run the sample uses only a 15 bit value. Much better results are obtained if you add the line `use Math::Random::MT qw(rand srand);` to the sample, but be prepared to wait a while and make sure your system has lots of memory available. Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: A series of random number and others by lightoverhead (Pilgrim) on Oct 09, 2008 at 05:49 UTC
Thanks GrandFather. Your understanding is right, these lines must not be duplicate. I just began to get used to perl,and am not familiar with these modules. It seems I have to study these modules thoroughly. As your suggestions of using hash or array, I have tried, but I encountered the memory problem. My machine can not even slurp 40 million lines into an array. That's why I create this index file first. So, under such a condition considering memory limitation (3GB). what could be the fastest resolution for such a case? Thank you.	[reply]
Re^4: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 06:07 UTC
That depends on your actual application. If you need exactly some number of lines and the distribution must be uniform then your current approach of generating an index file in some fashion seems appropriate. If that file is sorted then you can open both the index file and the data file at the same time, read the 'next' index from the index file and read lines from the data file until you reach the index, repeat until you reach the end of the index file. Consider: `use warnings; use strict; open my $rndLines, '<', "rand_sorted.txt" or die "Can't open rand_sort +ed.txt: $!"; while (defined (my $nextLine = <$rndLines>)) { chomp $nextLine; next unless $nextLine =~ /^\d+/; my $line; while (defined ($line = <>)) { last if $. >= $nextLine; } print $line if defined $line; } close $rndLines;` [download] A presume the `rand () < 0.5 and print while <>;` approximate number of lines solution I gave earlier doesn't do what you need? Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: A series of random number and others by Narveson (Chaplain) on Oct 09, 2008 at 06:36 UTC
32768 (very close to 2¹⁵ btw) How close do you need to be?	[reply]
Re^4: A series of random number and others by GrandFather (Saint) on Oct 09, 2008 at 06:52 UTC
Well, that depends on the application. For astronomy and politics within a couple of orders of magnitude is often enough. Engineering and statistics (except where quoted by politicians) like to be within an order of magnitude on the "safe" side. In maths you can often get a little closer than that. Perl reduces RSI - it saves typing	[reply]