Re^3: Tokenising a 10MB file trashes a 2GB machine

On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:

my $content = decode('UTF-8', 'tralala ' x 1E6);

my @a;
$#a = 10_000_000; # presize array
for (1..5)
{
    print "ITER $_\n";
    push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content;
    procinfo();
}
[download]

which on my system gives the following output:

ITER 1
Vsize: 248.18 MiB ( 260235264)
RSS  : 62362 pages
ITER 2
Vsize: 317.14 MiB ( 332550144)
RSS  : 80000 pages
ITER 3
Vsize: 393.71 MiB ( 412839936)
RSS  : 99598 pages
ITER 4
Vsize: 579.46 MiB ( 607612928)
RSS  : 147156 pages
ITER 5
Vsize: 625.23 MiB ( 655597568)
RSS  : 158895 pages
[download]

which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.

Dave.

Comment on Re^3: Tokenising a 10MB file trashes a 2GB machine Select or Download Code