Hello Perl Monks!

Writing Perl feels like riding a vintage VW bus. Things don’t work the way you expect, but you can always feel the love. (Learned from article)

I have a program that parses big strings -- 30MB of data. It intensively uses "\G" to continue parsing from the poit it has matched previously. Normally that program runs 6 (six) seconds. But every few days it is running 4 hours (yes... four hours) and consumes all computation power on server.

My program reads $big_string from file, encloses that string in parentheses (creating a new string), then passes reference to that newly created string to function "list_extr" which does parsing and returns deserialized data structure.

Function "list_extr" gets reference to the big string it should parse. I have found that when called like this (interpolate, take reference):

list_extr(\"($big_string)")

or like this (combine with dot, get reference to whole combination):

list_extr(  \( "(" . $big_string . ")" )      )

or like this (interpolate, save in new variable, take ref to variable):

my $s = "($big_string)"; list_extr(\$s)

it is sometimes very, very slow.

To solve the problem I have to pass that through spritnf:

list_extr(\sprintf("%s", $big_string))

This makes function "list_extr" work very fast (six seconds instead of a few hours).

My goal is to get $big_string, add parentheses at the begining and end of that string, pass reference of newly created string (enclosed in parentheses) to function list_extr. I hope that's clear.

I think the problem is with some string optimalizations in perl. When using string interpolated by perl it doesn't create new string, but somehow computes positions of parsing (pos $big_string, \G in regex) -- this takes a lot of computations. When using sprintf perl doesn't do optimalization, but creates new, plain, non interpolated, non combined, simple string. I think that optimalization is sometimes done, sometimes not -- this is why the problem occurs only once a few days. Below are parsing functions.

I have found that this solution sometimes is fast, sometimes slow:

list_extr(\( "(" . $big_string . ")" ));

and this solution is ALWAYS slow:

my $ttt = "(" . $big_string . ")"; list_extr(\$ttt);

# \G(?:\s|#.*$)* -- means start from last position \G, # skip spaces and comments # till the # end of line # ([[:alpha:]](?:_?[[:alnum:]])*) -- my identifier # restrictions; start with letter, then # letters, underscores, digits; but # two underscores in a row not allowed, # underscore at the end not allowed sub list_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; my @l; $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err"; while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(? +:\s|#.*$)*/mgc) { push @l, {'name' => $1, 'parm' => parm_extr($a)}; } $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err"; return \@l; } sub parm_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; my %p; $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err"; while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(? +:\s|#.*$)*/mgc) { my $n = $1; if ($$a =~ /\G([[:alpha:]](?:_?[[:alnum:]])*|"(?:[^\\" +[:cntrl:]]+|\\[\\"nt])*")/mgc) { $p{$n} = $1; } elsif ($$a =~ /\G(?=[-+.\d])/mgc) { $p{$n} = numb_extr($a); } elsif ($$a =~ /\G(?=\()/mgc) { $p{$n} = parm_extr($a); } else { croak "parse err"; } } $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err"; return \%p; } sub numb_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; $$a =~ /\G(?:\s|#.*$)*([-+]?\d*(\.\d*)?)/mgc or croak "parse e +rr"; my $n = $1; $n eq '0.0' and return 0; $n =~ /\A[-+](?!0.0\z)(?=[1-9]|0\.)\d+\.\d+(?<=[.\d][1-9]|\.0) +\z/ or croak "parse err"; length $n <= 15 + 2 or croak "numb too long"; $n = 0 + $n; # 1234567890.12345 abs $n > 99999999999999.9 and croak "numb out of range"; return 0 + $n; }


In reply to Hypothesis: some magic string optimalization in perl kills my server from time to time by leszekdubiel

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.