comment on

Hello, fine Monks. I have been playing around with parsing some code, and would like to see if anyone has any comments. First, let's pretend I am parsing a language suspiciously similar to Perl. We all know that only perl can parse Perl, so I'm not actually parsing Perl. Honestly.

I'm trying to wrap an extra block around any subroutine definition. The actual problem is far less interesting to me than trying to learn how to solve it more elegantly. I have two solutions that I've come up with. One uses Text::Balanced, the other doesn't. I'm heavily leaning towards using the T::B one, for all the obvious reasons, but I was curious if anyone has any suggestions for either.

Update: I should probably mention that comments and strings will be handled before this code runs, and as such I can safely assume they do not exist.

Here's the Text::Balanced one:

use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);

my $code = "sub blah {\n {}\n {}\n {}\n {}\n}
and some other stuff
sub another {\n}";

for my $end (find_ends($code)) {
    $code = substr($code, 0, $end)
          . "}"
          . substr($code, $end);
}
$code =~ s/sub/{sub/g;

print "$code\n";

sub find_ends {
    local $_ = $_[0];

    my @ends;
    while (/(sub\s+\S+\s*)(?={)/g) {
        () = extract_bracketed($_, '{}');
        push @ends, pos;
    }

    return @ends;
}
[download]

I think the find_ends subroutine is ok, but the concatenation with two substrs kind of bothers me. Is there a nicer way to do that? I thought about using substr as an lvalue, but I don't know of any way to insert new text without overwriting existing text.

Now, here's my homebrew version. All the code is the same as above, except the find_ends subroutine:

sub find_ends {
    local $_ = $_[0];

    my @ends;
    while (/(sub\s+\S+\s*{[^}]*})/g) {
        my $sub = $1;

        my $end;
        while ($sub) {
            my $open  = $sub =~ tr/{/{/;
            my $close = $sub =~ tr/}/}/;

            if ($open > $close and /\G([^}]*})/g) {
                $sub .= $1;
                $end = $+[0];
            } else {
                $end = $+[0];
                last;
            }
        }

        push @ends, $end;
    }

    return @ends;
}
[download]

There's a lot I don't like about this code. It's much longer, and seems kind of clunky. I have a feeling there are some things that could be cleaned up, but I'm just not in the right frame of mind to do so. There is one thing I do like about this code, though; it's far faster than the version using Text::Balanced. I suppose that's to be expected, and doesn't really matter too much (because the T::B version is still pretty fast), but at least it's something.

For anyone who's interested, here are the benchmark results I got:

Benchmark: running rd, tb for at least 10 CPU seconds...
        rd:  9 wallclock secs (10.36 usr +  0.01 sys = 10.37 CPU) @ 29
+908.68/s (n=310153)
        tb: 11 wallclock secs (10.55 usr +  0.01 sys = 10.56 CPU) @ 56
+02.84/s (n=59166)
      Rate   tb   rd
tb  5603/s   -- -81%
rd 29909/s 434%   --
[download]

If anyone has any suggestions, I'm all ears. I'm open to using a different algorithm altogether, if it would be better. Perhaps someone can surprise me with a single regex solution (even though I probably wouldn't use it, I would still be pleasantly surprised).

Thanks!

In reply to parsing code, finding block boundaries by revdiablo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.