Re^2: Tokenising a 10MB file trashes a 2GB machine

Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

#!/usr/bin/perl

use warnings;
use strict;

use Devel::Size qw(size total_size);
use Encode;

my $content = decode('UTF-8', 'tralala ' x 1E6);

print size($content),"\n";
print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n
+";

procinfo();

sub procinfo {
    my @stat;
    my $MiB = 1024 * 1024;

    if (open( STAT , '<:utf8', "/proc/$$/stat")) {
        @stat = split /\s+/ , <STAT>;
        close STAT ;
    }
    else {
        die "procinfo: Unable to open stat file.\n";

    }

    print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[
+22];
    print "RSS  : $stat[23] pages\n";
}
[download]

The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)

# ./tokenizer.pl
8000028
68000056
Vsize: 322.56 MiB ( 338231296)
RSS  : 79087 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)

# ./tokenizer.pl
8000048
112000096
Vsize: 537.61 MiB ( 563724288)
RSS  : 130586 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)

$ tokenizer.pl
8000048
112000096
Vsize: 539.42 MiB ( 565620736)
RSS  : 130571 pages
[download]

So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

Bye
PetaMem All Perl: MT, NLP, NLU

Comment on Re^2: Tokenising a 10MB file trashes a 2GB machine Select or Download Code

Replies are listed 'Best First'.
Re^3: Tokenising a 10MB file trashes a 2GB machine by dave_the_m (Monsignor) on Jul 16, 2008 at 13:19 UTC
On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code: `my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}\|\p{IsSpace}\|\p{P})}ms, $content; procinfo(); }` [download] which on my system gives the following output: `ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages` [download] which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable. Dave.	[reply] [d/l] [select]
Re^3: Tokenising a 10MB file trashes a 2GB machine by moritz (Cardinal) on Jul 16, 2008 at 12:22 UTC
I have Debian GNU/Linux on a boring 32 bit i386 machine. `perl 5.8.8: 8000028 68000056 Vsize: 322.68 MiB ( 338354176) RSS : 79112 pages perl 5.10.0: 8000036 84000100 Vsize: 270.80 MiB ( 283951104) RSS : 68365 pages` [download]	[reply] [d/l]