in reply to Tokenising a 10MB file trashes a 2GB machine

I can't reproduce the problem here. I took my 20MB junk mail file and splitted, and the usage of virtual memory was about 210MB, both for perl 5.8.8 and perl 5.10.0 (on linux).

Do you do anything else in your script? What's your OS, which perl version do you use?

  • Comment on Re: Tokenising a 10MB file trashes a 2GB machine

Replies are listed 'Best First'.
Re^2: Tokenising a 10MB file trashes a 2GB machine
by PetaMem (Priest) on Jul 16, 2008 at 12:11 UTC

    Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

    #!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; }

    The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

    • Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)
      # ./tokenizer.pl 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages
    • Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)
      # ./tokenizer.pl 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages
    • Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)
      $ tokenizer.pl 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages

    So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

      On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:
      my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content; procinfo(); }
      which on my system gives the following output:
      ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages
      which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.

      Dave.

      I have Debian GNU/Linux on a boring 32 bit i386 machine.
      perl 5.8.8: 8000028 68000056 Vsize: 322.68 MiB ( 338354176) RSS : 79112 pages perl 5.10.0: 8000036 84000100 Vsize: 270.80 MiB ( 283951104) RSS : 68365 pages
Re^2: Tokenising a 10MB file trashes a 2GB machine
by PetaMem (Priest) on Jul 16, 2008 at 09:31 UTC

    The OS is 64bit Gentoo Linux and the Perl is 5.8.8. But nevertheless I had the/my Perl-Interpreter under suspicion because it's compiled with GCC 4.3.1 -march=core2.

    Unfortunately it behaved so well on all machines, that there is no "conservative" perl left as reference. My bad. As also all testcases of Perl ran OK, this could be a candidate for a testcase. Or even a new class of tests (expected memory consumption). Maybe this could be carried to perl-porters

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

Re^2: Tokenising a 10MB file trashes a 2GB machine
by Anonymous Monk on Jul 16, 2008 at 16:19 UTC

    Here's one more output:

    $ perl ./tokenizer.pl 8000028 68000056 Vsize: 322.05 MiB ( 337694720) RSS : 79143 pages

    OS

    uname -a Linux ubuntu 2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i68 +6 GNU/Linux

    GCC Version

    gcc -v gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)