Re: Tokenising a 10MB file trashes a 2GB machine

Replies are listed 'Best First'.
Re^2: Tokenising a 10MB file trashes a 2GB machine by PetaMem (Priest) on Jul 16, 2008 at 12:11 UTC
Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?: #!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}\|\p{IsSpace}\|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; } [download] The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list. Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2) `# ./tokenizer.pl 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages` [download] Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2) `# ./tokenizer.pl 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages` [download] Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2) `$ tokenizer.pl 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages` [download] So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy... Bye PetaMem All Perl: MT, NLP, NLU	[reply] [d/l] [select]
Re^3: Tokenising a 10MB file trashes a 2GB machine by dave_the_m (Monsignor) on Jul 16, 2008 at 13:19 UTC
On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code: `my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}\|\p{IsSpace}\|\p{P})}ms, $content; procinfo(); }` [download] which on my system gives the following output: `ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages` [download] which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable. Dave.	[reply] [d/l] [select]
Re^3: Tokenising a 10MB file trashes a 2GB machine by moritz (Cardinal) on Jul 16, 2008 at 12:22 UTC
I have Debian GNU/Linux on a boring 32 bit i386 machine. `perl 5.8.8: 8000028 68000056 Vsize: 322.68 MiB ( 338354176) RSS : 79112 pages perl 5.10.0: 8000036 84000100 Vsize: 270.80 MiB ( 283951104) RSS : 68365 pages` [download]	[reply] [d/l]
Re^2: Tokenising a 10MB file trashes a 2GB machine by PetaMem (Priest) on Jul 16, 2008 at 09:31 UTC
The OS is 64bit Gentoo Linux and the Perl is 5.8.8. But nevertheless I had the/my Perl-Interpreter under suspicion because it's compiled with GCC 4.3.1 -march=core2. Unfortunately it behaved so well on all machines, that there is no "conservative" perl left as reference. My bad. As also all testcases of Perl ran OK, this could be a candidate for a testcase. Or even a new class of tests (expected memory consumption). Maybe this could be carried to perl-porters Bye PetaMem All Perl: MT, NLP, NLU	[reply]
Re^2: Tokenising a 10MB file trashes a 2GB machine by Anonymous Monk on Jul 16, 2008 at 16:19 UTC
Here's one more output: `$ perl ./tokenizer.pl 8000028 68000056 Vsize: 322.05 MiB ( 337694720) RSS : 79143 pages` [download] OS `uname -a Linux ubuntu 2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i68 +6 GNU/Linux` [download] GCC Version `gcc -v gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)` [download]	[reply] [d/l] [select]