in reply to Re^2: memory use array vs ref to array
in thread memory use array vs ref to array

If you need to slurp a big file and use minimum memory, slurp it as a string:

open FILE, '<', ....; my $bigstring; do{ local $/; $bigstring = <FILE>; }; ## NOTE: not my $bigstring = do{ local $/; <FILE> }; This consumes dou +ble the memory of the above.

You can then easily process the file line-by-line, by opening the big string as a memory file:

open RAMFILE, '<', \$bigstring; while( <RAMFILE> { ### process in the normal way. }

However, if your long running processing needs access to the lines as an array, then that could be a very inconvenient form for your task.

You might be tempted to build an index into the bigstring something like this:

my( $p, @index ) = 0; $index[ ++$p ] = tell( RAMFILE ) while <RAMFILE>; ## Then to randomly access line $n of the file my $nthLine = substr( $bigstring, $index[ $n ], $index[ $n+1 ] - $inde +x[ $n ] );

The problem is that the index will occupy almost as much memory as your original array of lines, and you're not just back to square one, but worse off.

However, you can build an index that occupies far less space:

my( $p, $index ) = ( 0, "\0" ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## then to randomly access line $n of the file my $nthLine = substr( $bigstring, vec( $index, $n, 32 ), vec( $index, +$n+1, 32 ) - vec( $index, $n, 32 ) );

Putting it all together:

#! perl -slw use strict; my $bigstring; do{ local $/; $bigstring = <> }; close ARGV; print length $bigstring; open RAMFILE, '<', \$bigstring or die $!; my( $p, $index ) = ( 0, chr(0) x ( 4 * 10e7 ) ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## print the 500,000th line of the file my $n = 500,000; print substr( $bigstring, vec( $index, $n, 32 ), vec( $index, $n+1, 32 + ) - vec( $index, $n, 32 ) ); <STDIN>; ## pause to check memory size __END__ [16:02:50.49] C:\test>dir test.dat Volume in drive C is Local Disk Volume Serial Number is 8C78-4B42 Directory of C:\test 15/09/2016 23:47 1,020,000,000 test.dat 1 File(s) 1,020,000,000 bytes 0 Dir(s) 379,695,529,984 bytes free [16:02:54.35] C:\test>wc -l test.dat 10000000 test.dat [16:03:00.02] C:\test>head test.dat !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +"""""""""""""""""""""""""""""" ###################################################################### +############################## $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ +$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& +&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' +'''''''''''''''''''''''''''''' (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( +(((((((((((((((((((((((((((((( )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) +)))))))))))))))))))))))))))))) ********************************************************************** +****************************** [16:03:03.22] C:\test>1171361 test.dat 1010000000 5555555555555555555555555555555555555555555555555555555555555555555555 +555555555555555555555555555555 # 1,774MB [16:03:22.93] C:\test>

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: memory use array vs ref to array
by choroba (Cardinal) on Sep 17, 2016 at 20:36 UTC
    You don't need the do if you don't use the value of the last expression.
    my $bigstring; { local $/; $bigstring = <FILE>; }

    Update: Moreover, I tried both ways on a 2GB file and didn't notice any difference in memory consumption:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ C +OMMAND 7164 choroba 20 0 1970364 1.858g 3508 R 48.00 24.04 0:00.48 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.990 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 48.51 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl

    Code run:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      You don't need the do if you don't use the value of the last expression.

      True. But it's habitual.

      Moreover, I tried both ways on a 2GB file and didn't notice any difference in memory consumption:

      I'm not familiar with the ins and outs of memory measurement on *nix; but I can demonstrate the difference on Windows.

      C:\test>p1 [0]{} Perl> print mem;; 9,432 K []{} Perl> open I, '<', 'test.dat'; my $s; do{ local $/; $s = <I> }; p +rint mem;; 997,784 K []{} Perl> Terminating on signal SIGINT(2) C:\test>p1 [0]{} Perl> print mem;; 9,440 K []{} Perl> open I, '<', 'test.dat'; my $s = do{ local $/; <I> }; print + mem;; 1,986,060 K []{} Perl> Terminating on signal SIGINT(2)

      As you can see, in the latter case, the memory assigned to the process is double: ( 997784 - 9432 ) * 2 + 9440 = 1986146; almost exactly the 1,986,060 K measured in the latter case.

      The reason is that in the later case, the data is read into an internal mortal temporary scalar; and then copied from there to the named lexical, before the memory attached to the temp is freed. As the allocation is greater than (from memory) 1MB, (on windows at least) such huge allocations are allocated directly from the OS's virtual memory rather than from the process' heap; and then get released directly back to OS.

      Which is usually a good thing, but it still means you need to have double the memory available for a short while, and if you are close to the limits, that can blow the process.

      Is it possible that large allocations are also freed back to the OS on *nix, and you are measuring after it has been freed?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        > Is it possible that large allocations are also freed back to the OS on *nix, and you are measuring after it has been freed?

        I'm not sure where this mem comes from. I tried with Memory::Usage, measuring the consumption with top running with 0.001s delay (was probably more in practice):

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Memory::Usage; my $file = shift; open my $FH, '<', $file or die $!; my $string; my $mu = 'Memory::Usage'->new; $mu->record('start'); { local $/; $string = <$FH> }; $mu->record('after do'); undef $string; seek $FH, 0, 0; $string = do { local $/; <$FH> }; $mu->record('after no do'); $mu->dump;

        Output:

        time vsz ( diff) rss ( diff) shared ( diff) code ( diff) + data ( diff) 0 17504 ( 17504) 3916 ( 3916) 3328 ( 3328) 1644 ( 1644) + 1060 ( 1060) start 0 1970632 ( 1953128) 1957112 ( 1953196) 3520 ( 192) 1644 +( 0) 1954188 ( 1953128) after do 1 1970632 ( 0) 1957248 ( 136) 3520 ( 0) 1644 ( + 0) 1954188 ( 0) after no do

        Top output:

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^4: memory use array vs ref to array
by dkhosla1 (Sexton) on Sep 21, 2016 at 04:02 UTC
    Will try the slurp bigstringy thing next! Thx for the hint.