in reply to memory use array vs ref to array

G'day dkhosla1,

Welcome to the Monastery.

Firstly, I was unable to repeat your exact tests because of issues with Memory::Usage. With a little more stringent testing from the author, this module would not install on many systems (including mine: Mac OS X) — see "Bug #83323 for Memory-Usage: Mark certain OS as unsupported" (raised three and a half years ago) for more on this.

However, I was interested in what you reported, and so ran different tests using Devel::Size. I tested the array much like you:

my @data = <$fh>;

I tested the arrayref in two different ways:

my $data_ref; @$data_ref = <$fh>;

and

my $data_ref = [ <$fh> ];

In ~/local/dev/test_data, I have a series of files I use for volume testing. Each consists of records of exactly 100 bytes (99 'X' characters plus a newline). They range in size from 1,000 to 10,000,000,000 bytes. I used the following for testing (a thousand, a million and a billion bytes):

$ ls -lSr text_?_1 -rw-r--r-- 1 ken staff 1000 8 Feb 2013 text_K_1 -rw-r--r-- 1 ken staff 1000000 8 Feb 2013 text_M_1 -rw-r--r-- 1 ken staff 1000000000 8 Feb 2013 text_G_1

Here's the test code:

#!/usr/bin/env perl -l use strict; use warnings; use autodie qw{:all}; use Devel::Size qw{size total_size}; { open my $fh, '<', $ARGV[0]; my @data = <$fh>; print 'size(\@data): ', size(\@data); print 'total_size(\@data): ', total_size(\@data); } { open my $fh, '<', $ARGV[0]; my $data_ref; @$data_ref = <$fh>; print 'size($data_ref): ', size($data_ref); print 'total_size($data_ref): ', total_size($data_ref); } { open my $fh, '<', $ARGV[0]; my $data_ref = [ <$fh> ]; print 'size($data_ref): ', size($data_ref); print 'total_size($data_ref): ', total_size($data_ref); }

Here's the test results:

$ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_K_1 size(\@data): 144 total_size(\@data): 1494 size($data_ref): 144 total_size($data_ref): 1494 size($data_ref): 144 total_size($data_ref): 1494
$ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_M_1 size(\@data): 80064 total_size(\@data): 1420366 size($data_ref): 80064 total_size($data_ref): 1420366 size($data_ref): 80064 total_size($data_ref): 1420366
$ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_G_1 size(\@data): 80000064 total_size(\@data): 1420322314 size($data_ref): 80000064 total_size($data_ref): 1420322314 size($data_ref): 80000064 total_size($data_ref): 1420322314

As you can see, the sizes of the variables are identical regardless of whether arrays or arrayrefs were used.

While the variables are only a little over 40% greater than the raw data size, this doesn't take into account the memory used by the entire process (which is what you were measuring). The 1kB and 1MB tests finished almost instantaneously; the 1GB tests took about 8secs each (measured very roughly by counting in my head) and total available system memory (determined very roughly by inspection) dropped from ~3.5 GB to ~0.5GB for each run. Although a little smaller, this does appear to be at least of the same order of magnitude as you report.

I suggest you take my test code, run it with your "bigfile", and see what results you get. I recommend that you run it at least a few times to check that you're getting consistent results.

— Ken

Replies are listed 'Best First'.
Re^2: memory use array vs ref to array
by dkhosla1 (Sexton) on Sep 17, 2016 at 13:41 UTC
    Thanks Ken (and others who responded). I will try your code and get back with the results. For others, to address 2 questions asked: - I am looking at overall process usage as at the end of the day, that is important (OS limits before it kills the process). However, in this simple test, I was not doing any processing. - I have to slurp the whole file as in the real code the processing time is high and we don't wan to leave the network file handle open for minutes and hours if possible.

      If you need to slurp a big file and use minimum memory, slurp it as a string:

      open FILE, '<', ....; my $bigstring; do{ local $/; $bigstring = <FILE>; }; ## NOTE: not my $bigstring = do{ local $/; <FILE> }; This consumes dou +ble the memory of the above.

      You can then easily process the file line-by-line, by opening the big string as a memory file:

      open RAMFILE, '<', \$bigstring; while( <RAMFILE> { ### process in the normal way. }

      However, if your long running processing needs access to the lines as an array, then that could be a very inconvenient form for your task.

      You might be tempted to build an index into the bigstring something like this:

      my( $p, @index ) = 0; $index[ ++$p ] = tell( RAMFILE ) while <RAMFILE>; ## Then to randomly access line $n of the file my $nthLine = substr( $bigstring, $index[ $n ], $index[ $n+1 ] - $inde +x[ $n ] );

      The problem is that the index will occupy almost as much memory as your original array of lines, and you're not just back to square one, but worse off.

      However, you can build an index that occupies far less space:

      my( $p, $index ) = ( 0, "\0" ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## then to randomly access line $n of the file my $nthLine = substr( $bigstring, vec( $index, $n, 32 ), vec( $index, +$n+1, 32 ) - vec( $index, $n, 32 ) );

      Putting it all together:

      #! perl -slw use strict; my $bigstring; do{ local $/; $bigstring = <> }; close ARGV; print length $bigstring; open RAMFILE, '<', \$bigstring or die $!; my( $p, $index ) = ( 0, chr(0) x ( 4 * 10e7 ) ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## print the 500,000th line of the file my $n = 500,000; print substr( $bigstring, vec( $index, $n, 32 ), vec( $index, $n+1, 32 + ) - vec( $index, $n, 32 ) ); <STDIN>; ## pause to check memory size __END__ [16:02:50.49] C:\test>dir test.dat Volume in drive C is Local Disk Volume Serial Number is 8C78-4B42 Directory of C:\test 15/09/2016 23:47 1,020,000,000 test.dat 1 File(s) 1,020,000,000 bytes 0 Dir(s) 379,695,529,984 bytes free [16:02:54.35] C:\test>wc -l test.dat 10000000 test.dat [16:03:00.02] C:\test>head test.dat !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +"""""""""""""""""""""""""""""" ###################################################################### +############################## $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ +$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& +&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' +'''''''''''''''''''''''''''''' (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( +(((((((((((((((((((((((((((((( )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) +)))))))))))))))))))))))))))))) ********************************************************************** +****************************** [16:03:03.22] C:\test>1171361 test.dat 1010000000 5555555555555555555555555555555555555555555555555555555555555555555555 +555555555555555555555555555555 # 1,774MB [16:03:22.93] C:\test>

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        You don't need the do if you don't use the value of the last expression.
        my $bigstring; { local $/; $bigstring = <FILE>; }

        Update: Moreover, I tried both ways on a 2GB file and didn't notice any difference in memory consumption:

        PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ C +OMMAND 7164 choroba 20 0 1970364 1.858g 3508 R 48.00 24.04 0:00.48 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.990 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 48.51 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl

        Code run:

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        Will try the slurp bigstringy thing next! Thx for the hint.
      Hi Ken, Follow up on test results for the 3 scenarios. I added the OS mem usage in each case (which is the real issue). The last option seems to be a little faster ( <$fh> ) but mem usage is similar to Memory::Usage shows. I am going to try the 'slurp string' as the next option as suggested by BrowserUK. Also interesting that using the reference (T2) uses 15% more RAM.
      T1: size(\@data): 177490392 total_size(\@data): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6569 xxxxxxx 25 0 6355m 6.0g 1592 R 100.0 51.5 0:46.67 perl T2: size($data_ref): 177490392 total_size($data_ref): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6603 xxxxxxx 25 0 7208m 6.9g 1592 R 100.0 58.6 0:51.38 perl T3: size($data_ref): 177490392 total_size($data_ref): 175598464 +9 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6625 xxxxxxx 25 0 6315m 6.0g 1592 R 100.0 51.2 0:47.73 perl