Desade has asked for the wisdom of the Perl Monks concerning the following question:
I'm a perl novice using a bit of academic perl that does some statistical stuff. I have a very large array (43,945,178 items) of strings in the format "x,y" where x and y are both floating point numbers in scientific notation.
For example: 4.90032E-8,1.25327E-7
The code implements a Fisher-Yates shuffle thus:
sub fisher_yates_shuffle { my $array = shift; my $i; my $mmCount = @$array; datePrint "Executing Fisher-Yates shuffle up to $mmCount times...\ +n"; my $mmStatus = 10000; for ($i = @$array; --$i; ) { if (--$mmStatus == 0) { datePrint "$i more to go...\n"; $mmStatus = 1000000; } my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }
I added the status reporting, and yes I realize that it starts at 10,000 but then goes every 1,000,000. That was on purpose because I was trying to figure out what I'm posting about.
The behavior that I'm seeing is that it takes like an HOUR to do the first million of these. As $i decreases, so does the time for each million iterations. Normally I would just put this down to a higher frequency of $i==$j hits, skipping the (I assume) expensive string swapping. But the rate of drop-off is crazy-fast:
1st mil: 50m 11s 2nd mil: 2m 13s 3rd mil: 7s ... 40th mil: 2s
This is running on a Linux 64-bit machine with 6GB of physical and 18GB of swap, and since the code isn't too thrifty with the memory earlier, we're definitely in swap-land at this point. Is that, the random-accessing of swap memory into physical until it's all there after a million cache-fails or so, the effect I'm seeing? Or is there some sinister perl-like beast lurking in this seemingly innocuous shuffle function?
Is there a better way to do this, other than the iterative slicing he's doing? I'm staring down the barrel of HUNDREDS of these runs for my wife, of which this shuffle is about 75% of the time per run. If I could improve this one function, I would get my machine back much sooner.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Very Large Arrays
by BrowserUk (Patriarch) on Feb 14, 2012 at 05:10 UTC | |
by dave_the_m (Monsignor) on Feb 14, 2012 at 09:59 UTC | |
by BrowserUk (Patriarch) on Feb 14, 2012 at 13:32 UTC | |
by dave_the_m (Monsignor) on Feb 14, 2012 at 20:13 UTC | |
by Desade (Initiate) on Feb 16, 2012 at 00:36 UTC | |
by BrowserUk (Patriarch) on Feb 16, 2012 at 09:48 UTC | |
by Desade (Initiate) on Mar 16, 2012 at 16:12 UTC | |
|
Re: Very Large Arrays
by educated_foo (Vicar) on Feb 14, 2012 at 04:48 UTC | |
|
Re: Very Large Arrays
by lidden (Curate) on Feb 14, 2012 at 09:49 UTC | |
|
Re: Very Large Arrays
by salva (Canon) on Feb 17, 2012 at 16:41 UTC | |
by Anonymous Monk on Sep 30, 2024 at 06:19 UTC | |
by Anonymous Monk on Feb 17, 2012 at 17:02 UTC | |
by salva (Canon) on Feb 17, 2012 at 17:27 UTC |