Dear monks,
it seems I have - again - stumbled across some example of Perls "obscene memory consumption habits". Basically I try to tokenize a 10MB file in memory and when it crashed my computer I gave it a closer look:
Take emails (simple text, no html, no attachements) concat them to a 10MB file, then do something like
my $content = slurp 'file'; print size($content),"\n"; print total_size([split m{\p{IsSpace}}ms, $content]),"\n";
using Devel::Size to determine who is the culprit gives the numbers 10485544 (file size) and 370379304 (result of split). While the two numbers are within expectation, the script takes more than 1,8GB RAM before being able to print out the second number. Which I think is somewhat insane. It's 64bit 5.8.8 on x86_64 arch.
Of course I am aware of String::Tokenizer and other iterative approaches to tokenizing tasks. I would just want to know from someone who is more knowledgeable of Perls interna why there is a *hidden* memory consumption by a factor of 5 that is not explainable to me. Is it something special with split? Some wild copying happening?
edit:
I've learned from this: Don't use split on large strings. I.e. having a whole file, try to compute it line by line or similar chunks. With other words: make sure the string you feed to split has a guaranteed maximum length or your machine will choke someday.
Bye
PetaMem All Perl: MT, NLP, NLU
In reply to Tokenising a 10MB file trashes a 2GB machine by PetaMem
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |