Dear monks,

it seems I have - again - stumbled across some example of Perls "obscene memory consumption habits". Basically I try to tokenize a 10MB file in memory and when it crashed my computer I gave it a closer look:

Take emails (simple text, no html, no attachements) concat them to a 10MB file, then do something like

my $content = slurp 'file'; print size($content),"\n"; print total_size([split m{\p{IsSpace}}ms, $content]),"\n";

using Devel::Size to determine who is the culprit gives the numbers 10485544 (file size) and 370379304 (result of split). While the two numbers are within expectation, the script takes more than 1,8GB RAM before being able to print out the second number. Which I think is somewhat insane. It's 64bit 5.8.8 on x86_64 arch.

Of course I am aware of String::Tokenizer and other iterative approaches to tokenizing tasks. I would just want to know from someone who is more knowledgeable of Perls interna why there is a *hidden* memory consumption by a factor of 5 that is not explainable to me. Is it something special with split? Some wild copying happening?

edit:
I've learned from this: Don't use split on large strings. I.e. having a whole file, try to compute it line by line or similar chunks. With other words: make sure the string you feed to split has a guaranteed maximum length or your machine will choke someday.

Bye
 PetaMem
    All Perl:   MT, NLP, NLU


In reply to Tokenising a 10MB file trashes a 2GB machine by PetaMem

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.