comment on

Hello Perl Monks!

Writing Perl feels like riding a vintage VW bus. Things don’t work the way you expect, but you can always feel the love. (Learned from article)

I have a program that parses big strings -- 30MB of data. It intensively uses "\G" to continue parsing from the poit it has matched previously. Normally that program runs 6 (six) seconds. But every few days it is running 4 hours (yes... four hours) and consumes all computation power on server.

My program reads $big_string from file, encloses that string in parentheses (creating a new string), then passes reference to that newly created string to function "list_extr" which does parsing and returns deserialized data structure.

Function "list_extr" gets reference to the big string it should parse. I have found that when called like this (interpolate, take reference):

list_extr(\"($big_string)")

or like this (combine with dot, get reference to whole combination):

list_extr( \( "(" . $big_string . ")" ) )

or like this (interpolate, save in new variable, take ref to variable):

my $s = "($big_string)"; 
list_extr(\$s)
[download]

it is sometimes very, very slow.

To solve the problem I have to pass that through spritnf:

list_extr(\sprintf("%s", $big_string))

This makes function "list_extr" work very fast (six seconds instead of a few hours).

My goal is to get $big_string, add parentheses at the begining and end of that string, pass reference of newly created string (enclosed in parentheses) to function list_extr. I hope that's clear.

I think the problem is with some string optimalizations in perl. When using string interpolated by perl it doesn't create new string, but somehow computes positions of parsing (pos $big_string, \G in regex) -- this takes a lot of computations. When using sprintf perl doesn't do optimalization, but creates new, plain, non interpolated, non combined, simple string. I think that optimalization is sometimes done, sometimes not -- this is why the problem occurs only once a few days. Below are parsing functions.

I have found that this solution sometimes is fast, sometimes slow:

list_extr(\( "(" . $big_string . ")" ));
[download]

and this solution is ALWAYS slow:

my $ttt = "(" . $big_string . ")";
list_extr(\$ttt);
[download]

 
# \G(?:\s|#.*$)* -- means start from last position \G, 
#                   skip spaces and comments # till the 
#                   end of line
# ([[:alpha:]](?:_?[[:alnum:]])*) -- my identifier 
#           restrictions; start with letter, then 
#           letters, underscores, digits; but 
#           two underscores in a row not allowed, 
#           underscore at the end not allowed

sub list_extr {
        my ($a) = @_;
        ref $a eq 'SCALAR' or croak "wrong ref";
        my @l;
        $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err";
        while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(?
+:\s|#.*$)*/mgc) {
                push @l, {'name' => $1, 'parm' => parm_extr($a)};
        }
        $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err";
        return \@l;
}

sub parm_extr {
        my ($a) = @_;
        ref $a eq 'SCALAR' or croak "wrong ref";
        my %p;
        $$a =~ /\G(?:\s|#.*$)*\(/mgc or croak "parse err";
        while ($$a =~ /\G(?:\s|#.*$)*([[:alpha:]](?:_?[[:alnum:]])*)(?
+:\s|#.*$)*/mgc) {
                my $n = $1;
                if ($$a =~ /\G([[:alpha:]](?:_?[[:alnum:]])*|"(?:[^\\"
+[:cntrl:]]+|\\[\\"nt])*")/mgc) {
                        $p{$n} = $1;
                } elsif ($$a =~ /\G(?=[-+.\d])/mgc) {
                        $p{$n} = numb_extr($a);
                } elsif ($$a =~ /\G(?=\()/mgc) {
                        $p{$n} = parm_extr($a);
                } else {
                        croak "parse err"; 
                }
        }
        $$a =~ /\G(?:\s|#.*$)*\)(?:\s|#.*$)*/mgc or croak "parse err";
        return \%p;
}

sub numb_extr {
        my ($a) = @_;
        ref $a eq 'SCALAR' or croak "wrong ref";
        $$a =~ /\G(?:\s|#.*$)*([-+]?\d*(\.\d*)?)/mgc or croak "parse e
+rr";
        my $n = $1;
        $n eq '0.0' and return 0;
        $n =~ /\A[-+](?!0.0\z)(?=[1-9]|0\.)\d+\.\d+(?<=[.\d][1-9]|\.0)
+\z/ or croak "parse err"; 
        length $n <= 15 + 2 or croak "numb too long"; 
        $n = 0 + $n;
        #        1234567890.12345
        abs $n > 99999999999999.9 and croak "numb out of range"; 
        return 0 + $n;
}
[download]

In reply to Hypothesis: some magic string optimalization in perl kills my server from time to time by leszekdubiel

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.