in reply to Re^2: Tokenising a 10MB file trashes a 2GB machine
in thread Tokenising a 10MB file trashes a 2GB machine

On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:
my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content; procinfo(); }
which on my system gives the following output:
ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages
which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.

Dave.