in reply to Re: Tokenising a 10MB file trashes a 2GB machine
in thread Tokenising a 10MB file trashes a 2GB machine
Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:
#!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; }
The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.
# ./tokenizer.pl 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages
# ./tokenizer.pl 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages
$ tokenizer.pl 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages
So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...
Bye
PetaMem All Perl: MT, NLP, NLU
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Tokenising a 10MB file trashes a 2GB machine
by dave_the_m (Monsignor) on Jul 16, 2008 at 13:19 UTC | |
Re^3: Tokenising a 10MB file trashes a 2GB machine
by moritz (Cardinal) on Jul 16, 2008 at 12:22 UTC |