mtsachev has asked for the wisdom of the Perl Monks concerning the following question:

I have a custom template engine (too late for using a generic one from CPAN or elsewhere), it has to be able to process marked up stuff through a function that processes the contents of a tag and/or the params of that tag. The tags are html like. There are some issues with this code though, the part which selects the html like attributes doesn't work on all versions of perl, i.e. the first (.*?) block. If I split that into two matches first w/o that block, and then one more with it works, but it's about twice as slow. Anyway even if it's not splitted it performs miserably on some boxes. Any ideas about optimizing this regex:
my $re = qr/$this->{mask_start}\Q$key\E(.*?)$this->{mask_end}(.*?)$thi +s->{mask_start_close}\Q$key\E$this->{mask_end_close}/is; while ($this->{template} =~ /$re/) { my $params = $this->mask_block_params($1, $2); my $html = &$callback($key, $params); $this->{template} = $` . $html . $'; }
Using s/$&/$html/; on the last line instead of $` . $html . $' doesn't make much difference.

Replies are listed 'Best First'.
Re: Regex speed issue
by tlm (Prior) on Aug 30, 2005 at 12:15 UTC

    How about

    $this->{ template } =~ s/$re/$this->replacement( $callback, $key, $1, +$2 )/eg;
    where
    sub replacement { my $this = shift; my ( $cb, $key, $first, $second ) = @_; return $cb->( $key, $this->mask_block_params( $first, $second ) ); }
    I don't know if this is any faster than what you have, but at least it avoids the use of evil variables.

    the lowliest monk

      This executes in approximately the same time.
Re: Regex speed issue
by GrandFather (Saint) on Aug 30, 2005 at 11:26 UTC

    Could you add some data to your test code so we can see what sort of stuff gives it grief?

    Some monks will be inclined to benchmark various solutions and a representative data set will help them too.


    Perl is Huffman encoded by design.

      Well the way I'm using it is like <mas:format_date><mas:foo_date></mas:format_date> so first foo_date is replaced by its value and then the callback function takes the contents and processes it according to preferences.

      Another usage is <mas:limit_string max="50">test test</mas:limit_string> it will limit the string length to 50 characters.

      I'm having trouble using a single regex to match both though, for some reason a (.*?) before $this->{mask_end} (in this case >) will not match the first example, i.e. when there're no params to pass to the function.

Re: Regex speed issue
by RMGir (Prior) on Aug 30, 2005 at 11:57 UTC
    (Edit: Or you could just do it tlm's way. Much cleaner to use s///eg.)

    $', $`, and $& are all equally bad. Using any of them anywhere in your program will penalize every regex you do.

    I'd suggest trying like this. As GrandFather mentioned, it would be easier to test with some test data, so this is just a complete WAG.

    my $result=""; my $re = qr/ (.*?) # Everything before the first tag $this->{mask_start} \Q$key\E(.*?) $this->{mask_end} (.*?) $this->{mask_start_close} (?=(.*)) # capture the rest, to get end of template /x; my $trailingText; while($this->{template} =~ /$re/g) { $result.=$1; # add the stuff between tags to results my $params = $this->mask_block_params($2, $3); my $html = &$callback($key, $params); $result.=$html; $trailingText=$4; } $result.=$trailingText; # $result should be what you're looking for. I think :)

    Mike
Re: Regex speed issue
by diotalevi (Canon) on Aug 30, 2005 at 13:24 UTC

    Your code is likely buggy. You've passed in read-only values to $this->mask_block_params(...) and you have no idea that $` and $' are still valid by the time you use them. Here's an alternate reformulation which also avoids incurring the $`, $&, $' speed penalties. If this is your only use of $`, $&, and $', you may find a nice speed improvement.

    while ($this->{template} =~ /$re/) { # Uses substr and @-/@+ instead of $` and $'. Also is sure to capt +ure those values before executing other potentially regex-using code. my ( $prior, $start, $close, $after ) = = ( substr( $this->{template}, 0, $-[0] ), $1, $2, substr( $this->{template}, $+[0] ) ); my $params = $this->mask_block_params( $start, $close ); my $html = &$callback($key, $params); $this->{template} = $prior . $html . $after; }

      It doesn't hurt to pass ro variables to mask_block_params.

      All methods mentioned here require the same cpu time. That reminded me that the replace is not the slow part, matching is the slow one. So what I thought is get rid of html like attributes, this was a pretty good speed improvement. 12s to 1s.

      Looks like the second .*? is pretty expensive.

        Replaced the regex with HTML::PullParser, takes 0.3s now.