comment on

When you run $str =~ /(\w+)/, in order to make $1 available, the first thing Perl does (sometimes) is to hide away a copy of the value of $str. $1 just points to a substring of that copy. This is so that $1's value doesn't get corrupted just because you modify $str:

$str = "===test===";
if( $str =~ /(\w+)/ ) {
    print "[$str] ($1)\n";
    substr( $str, 5, 2, 'mp' );
    print "[$str] ($1)\n";
}
__END__
[===test===] (test)
[===temp===] (test)
[download]

This isn't a big deal in this case (nor in most cases). And there are cases where it is a design choice that makes things more efficient. But it can have a significant impact on performance when you have a 1GB document in a string and then pull out some small part of it with a regex, especially if you do it over and over again:

use strict;
use Benchmark qw< cmpthese >;
my $doc = ' "a string",' x 1_024_000;
cmpthese( -1, {
    copy => sub {
        return $1
            if  $doc =~ /(['"])/;
    },
    substr => sub {
        return substr( $doc, pos($doc)-1, 1 )
            if  $doc =~ /['"]/g;
    },
} );
__END__
           Rate    copy  substr
copy     38.6/s      --   -100%
substr 382480/s 991549%      --
[download]

Re^3: regexp - repeatedly delete words expressed in alternation from end of string (speed) notes that this penalty (in modern versions of Perl) never applies when you use /g on your regex. So this shouldn't matter at all for the cases we are talking about here. So thanks for prompting me to dig up the details again!

I had done some incremental changes to JSON::Tiny and some benchmarks. And I saw that removing some grouping parens did actually make JSON::Tiny a little bit faster. But I was wrong about the underlying reason.

This also means that capturing parens isn't the reason that Parse::RecDescent has been considered significantly slower than it should be. I recall having heard that Parse::RecDescent was significantly slower than it could be because it ends up copying the whole document being parsed over and over. But that memory might be wrong. Another way that parsers can waste a lot of time copying text over and over is the classic mistake of repeatedly removing matched tokens from the front of the document string:

   if( $doc =~ s/^\s*(['"])// ) {
        parse_string($1);
[download]

But I don't know whether Parse::RecDescent suffers from that particular problem much or not. And it might not suffer from any significant performance problems at all at this point.

- tye

In reply to Re^3: How would you parse this? (oops) by tye
in thread How would you parse this? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.