Very Large Arrays

Desade has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Very Large Arrays by BrowserUk (Patriarch) on Feb 14, 2012 at 05:10 UTC
Ostensibly, there appears to be something going on here not obvious from the code snippet you posted. A 43 million element array of 20-ish character strings should come in well below 3GB. And your Fisher-Yates implementation is in-place, so should cause no memory growth. Ie. You should be well within your machines capacity without swapping and your shuffle should be completing within seconds. Is there a better way to do this, other than the iterative slicing he's doing? There are a couple of small improvements you could make to your shuffle: Shuffling through a temporary rather than a slice assignment is slightly more efficient: `my $temp = $array->[ $i ]; $array->[ $i ] = $array->[ $j ]; $array->[ $j ] = $temp;` [download] There is no reason to test for `$i == $j`. Swapping a value with itself does not invalidate the shuffle, and it is cheaper to do a single redundant swap than test 43 millions time to avoid it. That said, the savings from the above will make little dent on the current processing time and are completely negated by your tracing code. You need to look beyond the F-Y function for the cause of its slowness. Eg. Are you consuming large amounts of data outside of that subroutine that could be pushing you into swapping? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^2: Very Large Arrays by dave_the_m (Monsignor) on Feb 14, 2012 at 09:59 UTC
A 43 million element array of 20-ish character strings should come in well below 4GB. On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible. Based on the OPs comments, I would guess they aren't being generated in the most efficient way possible; and together with the overhead of whatever the rest of code does, that swap really is an issue. In that case, it may be worth the OP's while to try an initial sequential read of the the array to get all the elements swapped backed in, before hitting random elements. Whether this will increase performance is pure speculation at this point. Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr(). Dave.	[reply]
Re^3: Very Large Arrays by BrowserUk (Patriarch) on Feb 14, 2012 at 13:32 UTC
On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible. As much as that? I based my estimate upon this (running 5.10.1 64-bit): `c:\test>p1 [0] Perl> sub shuffle { my $r = shift; for ( 0 .. $#$r ) { my $a = $_ + rand( @$r - $_ ); my $t = $r->[ $a ]; $r->[ $a ] = $r->[ $_ ]; $r->[ $_ ] = $t; }; };; [0] Perl> @a = 1 .. 1e6;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> $t = time; shuffle \@a; print time - $t;; 20.9429998397827 [0] Perl> print total_size \@a;; 518218064` [download] That's 16 million integers occupying just under 500MB being shuffled in 20 seconds. I reasoned that 3 times that gives 48M in 1.5GB. Then replace the ints with the pointers and add 48mx20 (rounded up to 24) = 1.01GB for the strings themselves. gives 2.5GB total; and around 60 seconds to shuffle. Did I miss something? Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr(). I had similar thoughts. On the basis of the single pair of numbers the OP gave as an example, it looks like floats would be sufficient precision for his purposes. In which case, a packed string shuffled with substr seemed possible: `c:\test>p1 [0] Perl> sub shuffle { my $r = shift; my $n = length( $$r ) / 8; for ( 0 .. $n ) { my $a = $_ + rand( $n - $_ ); my $t = substr $$r, $a, 8; substr $$r, $a, 8, substr $$r, $_, 8; substr $$r, $_, 8, $t; }; };; [0] Perl> $a = pack 'f*', map -1e5 + rand(1e9), 1 .. 1e6;; [0] Perl> $a x= 96;; [0] Perl> print length $a;; 384000000 [0] Perl> $t = time; shuffle \$a; print time - $t;; 59.6950001716614` [download] It really rather surprised me that the shuffle time was so close to my estimate for the array of strings. (Despite needing 4 substrs per.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: Very Large Arrays by dave_the_m (Monsignor) on Feb 14, 2012 at 20:13 UTC
Re^2: Very Large Arrays by Desade (Initiate) on Feb 16, 2012 at 00:36 UTC
Yes, there is a great deal going on prior to this shuffle routine. The 43-mil array of number pairs being shuffled is generated from some calculations done earlier in the script, and those calcs themselves take two files each of size 2.6GB. Since this is not my code, it causes me no shame to say that those files are read in thus: `open(TEMP1, $overlap_files[$i]) \|\| die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; open(TEMP2, $overlap_files[$j]) \|\| die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; my @file1 = <TEMP1>; my @file2 = <TEMP2>; close TEMP1; close TEMP2;` [download] While it's doing that, you can watch the memory usage grow like a escalator to nowhere. Those arrays of strings then get iterated, split, parsed, calculated on, and eventually pairs of "interesting" values from each of them get pushed, one by one, onto the 43-mil array that is the subject of the shuffle. So by the time the shuffle gets called, we're quite a ways into the swap. Thank you all for all the analysis and ideas. You pretty much confirmed what I thought I was seeing, and despite my personal desire to re-write this thing in C, I think the biggest boon for my buck is going to simply be doubling the memory on this machine before starting on the analysis runs she has planned for me. It sounds like having this whole array in physical memory prior to calling the shuffle is likely to drastically reduce the run-time... more than anything else I can squeeze out with software optimization alone.	[reply] [d/l]
Re^3: Very Large Arrays by BrowserUk (Patriarch) on Feb 16, 2012 at 09:48 UTC
I think the biggest boon for my buck is going to simply be doubling the memory on this machine If the machine has the hardware capacity, that's definitely the easiest option. That said, you could probably make a few small changes to the script that would substantially reduce the memory requirement without having to rewrite it in C. For example, changing the small snippet you showed to the following will substantially reduce the memory requirement (at that point): `my $n = 0; open(TEMP1, $overlap_files[$i]) \|\| die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; my $n = 0; my @file1; $file1[ $n++ ] = $_ while <TEMP1>; close TEMP1; open(TEMP2, $overlap_files[$j]) \|\| die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; $n = 0; my @file2 $file2[ $n++ ] = $_ while <TEMP2>; close TEMP2;` [download] And building a big string instead of an array for the shuffle would reduce it further. Often, just a little inspection can show where large amounts of data can be incrementally thrown away as you are finished with it, and can add up to huge savings. For example, by the time you are ready to do the shuffle, have you finished with the input arrays? And, does the algorithm for building the list in interesting pairs require that both input arrays be loaded into memory in their entirety, or could you processes them 1 line at a time? Perhaps reading them in lockstep. Anyway, good luck. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^4: Very Large Arrays by Desade (Initiate) on Mar 16, 2012 at 16:12 UTC
Re: Very Large Arrays by educated_foo (Vicar) on Feb 14, 2012 at 04:48 UTC
I would guess that's a swapping problem (you might be able to verify this with `/usr/bin/time -l`). I can think of several ways to cut down on memory: (1) shuffle an array of indices; (2) store the indices in a string, and access them using vec; (3) use PDL, which has compact numerical arrays.	[reply] [d/l]
Re: Very Large Arrays by lidden (Curate) on Feb 14, 2012 at 09:49 UTC
Not directly related to you problem, but Perl comes with a built in shuffle. `use List::Util 'shuffle'; my @A = qw- 1 2 3 4 -; @A = shuffle @A; say "@A";` [download]	[reply] [d/l]
Re: Very Large Arrays by salva (Canon) on Feb 17, 2012 at 16:41 UTC
Shuffling an array is one of those thinks that can be done much more efficiently in C/XS than in Perl. I have just uploaded to CPAN Array::Shuffle, that is one or two orders of magnitude faster than List::Util::shuffle or your hand-crafted shuffle implementation in Perl. It's memory usage is O(1). (List::Util::shuffle is also implemented in C, but it has the wrong interface from a performance point of view)	[reply]
Re^2: Very Large Arrays by Anonymous Monk on Sep 30, 2024 at 06:19 UTC
Just tried the included benchmark script with perl 5.40 on macOS and shuffle_array fails every time with bus error or segfault. And shuffle_huge_array fails every time claiming the array has attached magic.	[reply]
Re^2: Very Large Arrays by Anonymous Monk on Feb 17, 2012 at 17:02 UTC
As List::Util is available separately on CPAN, why didn't you submit a patch to that, so everybody benefits?	[reply]
Re^3: Very Large Arrays by salva (Canon) on Feb 17, 2012 at 17:27 UTC
Because Array::Shuffle functions do the operations in-place for better performance and so they do not really fit inside List::Util, and also because List::Util is famous for refusing to include additional functions (that's why we also have List::MoreUtil). Besides that, because I also wanted to play with different shuffle algorithms, and so creating a new specialiced module made sense to me.	[reply]