comment on

I'm a perl novice using a bit of academic perl that does some statistical stuff. I have a very large array (43,945,178 items) of strings in the format "x,y" where x and y are both floating point numbers in scientific notation.

For example: 4.90032E-8,1.25327E-7

The code implements a Fisher-Yates shuffle thus:

sub fisher_yates_shuffle {
    my $array = shift;
    my $i;
    my $mmCount = @$array;
    datePrint "Executing Fisher-Yates shuffle up to $mmCount times...\
+n";
    my $mmStatus = 10000;
    for ($i = @$array; --$i; ) {
        if (--$mmStatus == 0) {
           datePrint "$i more to go...\n";
           $mmStatus = 1000000;
        }
        my $j = int rand ($i+1);
        next if $i == $j;
        @$array[$i,$j] = @$array[$j,$i];
    }
}
[download]

I added the status reporting, and yes I realize that it starts at 10,000 but then goes every 1,000,000. That was on purpose because I was trying to figure out what I'm posting about.

The behavior that I'm seeing is that it takes like an HOUR to do the first million of these. As $i decreases, so does the time for each million iterations. Normally I would just put this down to a higher frequency of $i==$j hits, skipping the (I assume) expensive string swapping. But the rate of drop-off is crazy-fast:

1st mil: 50m 11s
2nd mil:  2m 13s
3rd mil:      7s
...
40th mil:     2s
[download]

This is running on a Linux 64-bit machine with 6GB of physical and 18GB of swap, and since the code isn't too thrifty with the memory earlier, we're definitely in swap-land at this point. Is that, the random-accessing of swap memory into physical until it's all there after a million cache-fails or so, the effect I'm seeing? Or is there some sinister perl-like beast lurking in this seemingly innocuous shuffle function?

Is there a better way to do this, other than the iterative slicing he's doing? I'm staring down the barrel of HUNDREDS of these runs for my wife, of which this shuffle is about 75% of the time per run. If I could improve this one function, I would get my machine back much sooner.

In reply to Very Large Arrays by Desade

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.