perl -Mre=debug -e"$f = q~a b c~ x 1E4;$g = [split m{\p{IsSpace}}ms, $ +f ];" 2>2
The sheer size of the debugger output makes it impossible to run with the 1E7 multiplier and although I still do not know how to interpret the output, maybe someone here knows.
MultiplierSize of debugger output
114KiB
1021KiB
100137KiB
1E35,7MiB
1E4507MiB
1E550GiB

Therefore I predict output of the debugger would be (at least) about 5TiB for 1E6. The size comes from the fact, that there is always a printout of the complete dataset that will be matched against, which is every time the regexp matches shortened by one token. Therefore the numbers mentioned above halve if we have e.g. q{1234 } instead q{a b c}.

In between these printouts there is always the same output:

Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 1234 + " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49969 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49970 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49974 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49975 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49979 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49980 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 "

(this is taken from near the end of the debugger output to keep the size of the data sections small)

So unfortunately I do not see much from this output that could give me a hint for the additional memory consumption. Except probably the "savestack=6", but I guess that is on every other perl interpreter the same. I'll try to compile Perl conservatively with an old GCC and generic CPU architecture (maybe the new gcc does some wasting alignments for Core2 architecture).

Bye
 PetaMem
    All Perl:   MT, NLP, NLU


In reply to Re^2: Tokenising a 10MB file trashes a 2GB machine by PetaMem
in thread Tokenising a 10MB file trashes a 2GB machine by PetaMem

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.