leszekdubiel has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks!

Writing Perl feels like riding a vintage VW bus. Things don’t work the way you expect, but you can always feel the love. (Learned from article)

I have a program that parses big strings -- 30MB of data. It intensively uses "\G" to continue parsing from the poit it has matched previously. Normally that program runs 6 (six) seconds. But every few days it is running 4 hours (yes... four hours) and consumes all computation power on server.

My program reads $big_string from file, encloses that string in parentheses (creating a new string), then passes reference to that newly created string to function "list_extr" which does parsing and returns deserialized data structure.

Function "list_extr" gets reference to the big string it should parse. I have found that when called like this (interpolate, take reference):

list_extr(\"($big_string)")

or like this (combine with dot, get reference to whole combination):

list_extr(  \( "(" . $big_string . ")" )      )

or like this (interpolate, save in new variable, take ref to variable):

my $s = "($big_string)"; list_extr(\$s)

it is sometimes very, very slow.

To solve the problem I have to pass that through spritnf:

list_extr(\sprintf("%s", $big_string))

This makes function "list_extr" work very fast (six seconds instead of a few hours).

My goal is to get $big_string, add parentheses at the begining and end of that string, pass reference of newly created string (enclosed in parentheses) to function list_extr. I hope that's clear.

I think the problem is with some string optimalizations in perl. When using string interpolated by perl it doesn't create new string, but somehow computes positions of parsing (pos $big_string, \G in regex) -- this takes a lot of computations. When using sprintf perl doesn't do optimalization, but creates new, plain, non interpolated, non combined, simple string. I think that optimalization is sometimes done, sometimes not -- this is why the problem occurs only once a few days. Below are parsing functions.

I have found that this solution sometimes is fast, sometimes slow:

list_extr(\( "(" . $big_string . ")" ));

and this solution is ALWAYS slow:

my $ttt = "(" . $big_string . ")"; list_extr(\$ttt);

# \G(?:\s|#.*$)* -- means start from last position \G, # skip spaces and comments # till the # end of line # ([[:alpha:]](?:_?[[:alnum:]])*) -- my identifier # restrictions; start with letter, then # letters, underscores, digits; but # two underscores in a row not allowed, # underscore at the end not allowed sub list_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; my @l; $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err"; while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(? +:\s|#.*$)*/mgc) { push @l, {'name' => $1, 'parm' => parm_extr($a)}; } $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err"; return \@l; } sub parm_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; my %p; $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err"; while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(? +:\s|#.*$)*/mgc) { my $n = $1; if ($$a =~ /\G([[:alpha:]](?:_?[[:alnum:]])*|"(?:[^\\" +[:cntrl:]]+|\\[\\"nt])*")/mgc) { $p{$n} = $1; } elsif ($$a =~ /\G(?=[-+.\d])/mgc) { $p{$n} = numb_extr($a); } elsif ($$a =~ /\G(?=\()/mgc) { $p{$n} = parm_extr($a); } else { croak "parse err"; } } $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err"; return \%p; } sub numb_extr { my ($a) = @_; ref $a eq 'SCALAR' or croak "wrong ref"; $$a =~ /\G(?:\s|#.*$)*([-+]?\d*(\.\d*)?)/mgc or croak "parse e +rr"; my $n = $1; $n eq '0.0' and return 0; $n =~ /\A[-+](?!0.0\z)(?=[1-9]|0\.)\d+\.\d+(?<=[.\d][1-9]|\.0) +\z/ or croak "parse err"; length $n <= 15 + 2 or croak "numb too long"; $n = 0 + $n; # 1234567890.12345 abs $n > 99999999999999.9 and croak "numb out of range"; return 0 + $n; }

  • Comment on Hypothesis: some magic string optimalization in perl kills my server from time to time
  • Select or Download Code

Replies are listed 'Best First'.
Re: Hypothesis: some magic string optimalization in perl kills my server from time to time
by tybalt89 (Monsignor) on Sep 30, 2016 at 13:37 UTC

    I had what I think is a similar problem. When I changed my whitespace eater to

    my $ws = qr/(?:#.*|\s+)*+/; # white space

    the problem went away. Try it and let us know. Also something like

    my $ws = qr/(?:#.*+|\s++)*/; # white space

    should work, or

    my $ws = qr/(?>(?:#.*|\s+)*)/; # white space

    anything that will not give back after matching.

      I have changed to possesive version -- unfortunatelly it didn't help... Still the same effect

      $$a =~ /\G(?:\s++|#.*+)*+\(/gc or croak "parenthesis expected"

        And do you have a small input case where the slowness happens (so I can do testing at home :) ?

        Which regex does it stop at?