Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:
#!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; }
The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.
# ./tokenizer.pl 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages
# ./tokenizer.pl 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages
$ tokenizer.pl 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages
So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...
Bye
PetaMem All Perl: MT, NLP, NLU
In reply to Re^2: Tokenising a 10MB file trashes a 2GB machine
by PetaMem
in thread Tokenising a 10MB file trashes a 2GB machine
by PetaMem
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |