comment on

Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

#!/usr/bin/perl

use warnings;
use strict;

use Devel::Size qw(size total_size);
use Encode;

my $content = decode('UTF-8', 'tralala ' x 1E6);

print size($content),"\n";
print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n
+";

procinfo();

sub procinfo {
    my @stat;
    my $MiB = 1024 * 1024;

    if (open( STAT , '<:utf8', "/proc/$$/stat")) {
        @stat = split /\s+/ , <STAT>;
        close STAT ;
    }
    else {
        die "procinfo: Unable to open stat file.\n";

    }

    print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[
+22];
    print "RSS  : $stat[23] pages\n";
}
[download]

The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)

# ./tokenizer.pl
8000028
68000056
Vsize: 322.56 MiB ( 338231296)
RSS  : 79087 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)

# ./tokenizer.pl
8000048
112000096
Vsize: 537.61 MiB ( 563724288)
RSS  : 130586 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)

$ tokenizer.pl
8000048
112000096
Vsize: 539.42 MiB ( 565620736)
RSS  : 130571 pages
[download]

So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

Bye
PetaMem All Perl: MT, NLP, NLU

In reply to Re^2: Tokenising a 10MB file trashes a 2GB machine by PetaMem
in thread Tokenising a 10MB file trashes a 2GB machine by PetaMem

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.