Desade has asked for the wisdom of the Perl Monks concerning the following question:

I'm a perl novice using a bit of academic perl that does some statistical stuff. I have a very large array (43,945,178 items) of strings in the format "x,y" where x and y are both floating point numbers in scientific notation.

For example: 4.90032E-8,1.25327E-7

The code implements a Fisher-Yates shuffle thus:

sub fisher_yates_shuffle { my $array = shift; my $i; my $mmCount = @$array; datePrint "Executing Fisher-Yates shuffle up to $mmCount times...\ +n"; my $mmStatus = 10000; for ($i = @$array; --$i; ) { if (--$mmStatus == 0) { datePrint "$i more to go...\n"; $mmStatus = 1000000; } my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }

I added the status reporting, and yes I realize that it starts at 10,000 but then goes every 1,000,000. That was on purpose because I was trying to figure out what I'm posting about.

The behavior that I'm seeing is that it takes like an HOUR to do the first million of these. As $i decreases, so does the time for each million iterations. Normally I would just put this down to a higher frequency of $i==$j hits, skipping the (I assume) expensive string swapping. But the rate of drop-off is crazy-fast:

1st mil: 50m 11s 2nd mil: 2m 13s 3rd mil: 7s ... 40th mil: 2s

This is running on a Linux 64-bit machine with 6GB of physical and 18GB of swap, and since the code isn't too thrifty with the memory earlier, we're definitely in swap-land at this point. Is that, the random-accessing of swap memory into physical until it's all there after a million cache-fails or so, the effect I'm seeing? Or is there some sinister perl-like beast lurking in this seemingly innocuous shuffle function?

Is there a better way to do this, other than the iterative slicing he's doing? I'm staring down the barrel of HUNDREDS of these runs for my wife, of which this shuffle is about 75% of the time per run. If I could improve this one function, I would get my machine back much sooner.

Replies are listed 'Best First'.
Re: Very Large Arrays
by BrowserUk (Patriarch) on Feb 14, 2012 at 05:10 UTC

    Ostensibly, there appears to be something going on here not obvious from the code snippet you posted.

    A 43 million element array of 20-ish character strings should come in well below 3GB. And your Fisher-Yates implementation is in-place, so should cause no memory growth. Ie. You should be well within your machines capacity without swapping and your shuffle should be completing within seconds.

    Is there a better way to do this, other than the iterative slicing he's doing?

    There are a couple of small improvements you could make to your shuffle:

    1. Shuffling through a temporary rather than a slice assignment is slightly more efficient:
      my $temp = $array->[ $i ]; $array->[ $i ] = $array->[ $j ]; $array->[ $j ] = $temp;
    2. There is no reason to test for $i == $j.

      Swapping a value with itself does not invalidate the shuffle, and it is cheaper to do a single redundant swap than test 43 millions time to avoid it.

    That said, the savings from the above will make little dent on the current processing time and are completely negated by your tracing code.

    You need to look beyond the F-Y function for the cause of its slowness. Eg. Are you consuming large amounts of data outside of that subroutine that could be pushing you into swapping?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      A 43 million element array of 20-ish character strings should come in well below 4GB.
      On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible. Based on the OPs comments, I would guess they aren't being generated in the most efficient way possible; and together with the overhead of whatever the rest of code does, that swap really is an issue.

      In that case, it may be worth the OP's while to try an initial sequential read of the the array to get all the elements swapped backed in, before hitting random elements. Whether this will increase performance is pure speculation at this point.

      Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr().

      Dave.

        On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible.

        As much as that? I based my estimate upon this (running 5.10.1 64-bit):

        c:\test>p1 [0] Perl> sub shuffle { my $r = shift; for ( 0 .. $#$r ) { my $a = $_ + rand( @$r - $_ ); my $t = $r->[ $a ]; $r->[ $a ] = $r->[ $_ ]; $r->[ $_ ] = $t; }; };; [0] Perl> @a = 1 .. 1e6;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> $t = time; shuffle \@a; print time - $t;; 20.9429998397827 [0] Perl> print total_size \@a;; 518218064

        That's 16 million integers occupying just under 500MB being shuffled in 20 seconds.

        I reasoned that 3 times that gives 48M in 1.5GB. Then replace the ints with the pointers and add 48mx20 (rounded up to 24) = 1.01GB for the strings themselves. gives 2.5GB total; and around 60 seconds to shuffle. Did I miss something?

        Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr().

        I had similar thoughts. On the basis of the single pair of numbers the OP gave as an example, it looks like floats would be sufficient precision for his purposes. In which case, a packed string shuffled with substr seemed possible:

        c:\test>p1 [0] Perl> sub shuffle { my $r = shift; my $n = length( $$r ) / 8; for ( 0 .. $n ) { my $a = $_ + rand( $n - $_ ); my $t = substr $$r, $a, 8; substr $$r, $a, 8, substr $$r, $_, 8; substr $$r, $_, 8, $t; }; };; [0] Perl> $a = pack 'f*', map -1e5 + rand(1e9), 1 .. 1e6;; [0] Perl> $a x= 96;; [0] Perl> print length $a;; 384000000 [0] Perl> $t = time; shuffle \$a; print time - $t;; 59.6950001716614

        It really rather surprised me that the shuffle time was so close to my estimate for the array of strings. (Despite needing 4 substrs per.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Yes, there is a great deal going on prior to this shuffle routine. The 43-mil array of number pairs being shuffled is generated from some calculations done earlier in the script, and those calcs themselves take two files each of size 2.6GB. Since this is not my code, it causes me no shame to say that those files are read in thus:

      open(TEMP1, $overlap_files[$i]) || die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; open(TEMP2, $overlap_files[$j]) || die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; my @file1 = <TEMP1>; my @file2 = <TEMP2>; close TEMP1; close TEMP2;

      While it's doing that, you can watch the memory usage grow like a escalator to nowhere. Those arrays of strings then get iterated, split, parsed, calculated on, and eventually pairs of "interesting" values from each of them get pushed, one by one, onto the 43-mil array that is the subject of the shuffle. So by the time the shuffle gets called, we're quite a ways into the swap.

      Thank you all for all the analysis and ideas. You pretty much confirmed what I thought I was seeing, and despite my personal desire to re-write this thing in C, I think the biggest boon for my buck is going to simply be doubling the memory on this machine before starting on the analysis runs she has planned for me. It sounds like having this whole array in physical memory prior to calling the shuffle is likely to drastically reduce the run-time... more than anything else I can squeeze out with software optimization alone.

        I think the biggest boon for my buck is going to simply be doubling the memory on this machine

        If the machine has the hardware capacity, that's definitely the easiest option.

        That said, you could probably make a few small changes to the script that would substantially reduce the memory requirement without having to rewrite it in C.

        For example, changing the small snippet you showed to the following will substantially reduce the memory requirement (at that point):

        my $n = 0; open(TEMP1, $overlap_files[$i]) || die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; my $n = 0; my @file1; $file1[ $n++ ] = $_ while <TEMP1>; close TEMP1; open(TEMP2, $overlap_files[$j]) || die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; $n = 0; my @file2 $file2[ $n++ ] = $_ while <TEMP2>; close TEMP2;

        And building a big string instead of an array for the shuffle would reduce it further.

        Often, just a little inspection can show where large amounts of data can be incrementally thrown away as you are finished with it, and can add up to huge savings.

        For example, by the time you are ready to do the shuffle, have you finished with the input arrays?

        And, does the algorithm for building the list in interesting pairs require that both input arrays be loaded into memory in their entirety, or could you processes them 1 line at a time? Perhaps reading them in lockstep.

        Anyway, good luck.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Very Large Arrays
by educated_foo (Vicar) on Feb 14, 2012 at 04:48 UTC
    I would guess that's a swapping problem (you might be able to verify this with /usr/bin/time -l). I can think of several ways to cut down on memory: (1) shuffle an array of indices; (2) store the indices in a string, and access them using vec; (3) use PDL, which has compact numerical arrays.
Re: Very Large Arrays
by lidden (Curate) on Feb 14, 2012 at 09:49 UTC
    Not directly related to you problem, but Perl comes with a built in shuffle.
    use List::Util 'shuffle'; my @A = qw- 1 2 3 4 -; @A = shuffle @A; say "@A";
Re: Very Large Arrays
by salva (Canon) on Feb 17, 2012 at 16:41 UTC
    Shuffling an array is one of those thinks that can be done much more efficiently in C/XS than in Perl.

    I have just uploaded to CPAN Array::Shuffle, that is one or two orders of magnitude faster than List::Util::shuffle or your hand-crafted shuffle implementation in Perl.

    It's memory usage is O(1).

    (List::Util::shuffle is also implemented in C, but it has the wrong interface from a performance point of view)

      Just tried the included benchmark script with perl 5.40 on macOS and shuffle_array fails every time with bus error or segfault. And shuffle_huge_array fails every time claiming the array has attached magic.
      As List::Util is available separately on CPAN, why didn't you submit a patch to that, so everybody benefits?
        Because Array::Shuffle functions do the operations in-place for better performance and so they do not really fit inside List::Util, and also because List::Util is famous for refusing to include additional functions (that's why we also have List::MoreUtil).

        Besides that, because I also wanted to play with different shuffle algorithms, and so creating a new specialiced module made sense to me.