parsing code, finding block boundaries

Hello, fine Monks. I have been playing around with parsing some code, and would like to see if anyone has any comments. First, let's pretend I am parsing a language suspiciously similar to Perl. We all know that only perl can parse Perl, so I'm not actually parsing Perl. Honestly.

I'm trying to wrap an extra block around any subroutine definition. The actual problem is far less interesting to me than trying to learn how to solve it more elegantly. I have two solutions that I've come up with. One uses Text::Balanced, the other doesn't. I'm heavily leaning towards using the T::B one, for all the obvious reasons, but I was curious if anyone has any suggestions for either.

Update: I should probably mention that comments and strings will be handled before this code runs, and as such I can safely assume they do not exist.

Here's the Text::Balanced one:

use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);

my $code = "sub blah {\n {}\n {}\n {}\n {}\n}
and some other stuff
sub another {\n}";

for my $end (find_ends($code)) {
    $code = substr($code, 0, $end)
          . "}"
          . substr($code, $end);
}
$code =~ s/sub/{sub/g;

print "$code\n";

sub find_ends {
    local $_ = $_[0];

    my @ends;
    while (/(sub\s+\S+\s*)(?={)/g) {
        () = extract_bracketed($_, '{}');
        push @ends, pos;
    }

    return @ends;
}
[download]

I think the find_ends subroutine is ok, but the concatenation with two substrs kind of bothers me. Is there a nicer way to do that? I thought about using substr as an lvalue, but I don't know of any way to insert new text without overwriting existing text.

Now, here's my homebrew version. All the code is the same as above, except the find_ends subroutine:

sub find_ends {
    local $_ = $_[0];

    my @ends;
    while (/(sub\s+\S+\s*{[^}]*})/g) {
        my $sub = $1;

        my $end;
        while ($sub) {
            my $open  = $sub =~ tr/{/{/;
            my $close = $sub =~ tr/}/}/;

            if ($open > $close and /\G([^}]*})/g) {
                $sub .= $1;
                $end = $+[0];
            } else {
                $end = $+[0];
                last;
            }
        }

        push @ends, $end;
    }

    return @ends;
}
[download]

There's a lot I don't like about this code. It's much longer, and seems kind of clunky. I have a feeling there are some things that could be cleaned up, but I'm just not in the right frame of mind to do so. There is one thing I do like about this code, though; it's far faster than the version using Text::Balanced. I suppose that's to be expected, and doesn't really matter too much (because the T::B version is still pretty fast), but at least it's something.

For anyone who's interested, here are the benchmark results I got:

Benchmark: running rd, tb for at least 10 CPU seconds...
        rd:  9 wallclock secs (10.36 usr +  0.01 sys = 10.37 CPU) @ 29
+908.68/s (n=310153)
        tb: 11 wallclock secs (10.55 usr +  0.01 sys = 10.56 CPU) @ 56
+02.84/s (n=59166)
      Rate   tb   rd
tb  5603/s   -- -81%
rd 29909/s 434%   --
[download]

If anyone has any suggestions, I'm all ears. I'm open to using a different algorithm altogether, if it would be better. Perhaps someone can surprise me with a single regex solution (even though I probably wouldn't use it, I would still be pleasantly surprised).

Thanks!

Comment on parsing code, finding block boundaries Select or Download Code

Replies are listed 'Best First'.
Re: parsing code, finding block boundaries by jryan (Vicar) on Aug 05, 2004 at 21:36 UTC
As far as dealing with extracting code, both of your routines are broken; neither deal with end-curlies in quotes. For instance, both routines will bomb on: `sub blah { return '}'; }` [download] You can use `Text::Balanced::extract_codeblock($code, '{}');` instead of `extract_bracketed($code, '{}') to handle this, but even that will stil +l bomb on curlies in comments: </p> <code> sub blah { # } is such a neat character return '}'; }` [download] If you want to handle this case, but don't want the complexity of writing a real parser, then you can try this hack instead: $code = blockize($code); sub blockize { my ($code) = @_; # describe a block: use re 'eval'; my $comment = qr/#[^\n]/; my $single_quoted = qr/' [^\\']+ (?: \\. [^\\']+ ) '/; my $double_quoted = qr/" [^\\"]+ (?: \\. [^\\"]+ )* "/; my $block; $block = qr/ { (?: (?> [^#'"{}]+ ) \| $comment \| $single_quoted \| $double_quoted \| (??{$block}) )* } /x; # now that we have a block definition set up, its just a simple su +bstitution $code =~ s/sub \s+ (\w+) \s+ ($block)/ format_code("$1", "$2") /ge +x; return $code; sub format_code { my ($name, $block) = @_; $block =~ s/\n/\n /g; return "{\n sub $name $block\n}"; } } [download] It will even format the added block nicely :) Hopefully your perl-like language doesn't have pod; that's a whole 'nother beast. Read more... (1190 Bytes)	[reply] [d/l] [select]
Re^2: parsing code, finding block boundaries by revdiablo (Prior) on Aug 05, 2004 at 22:20 UTC
I appreciate the response. I'll go over it in more detail later, but I should have mentioned earlier that comments and quotes are not a problem. They will be taken care of before this code runs.	[reply]
Re: parsing code, finding block boundaries by mirod (Canon) on Aug 05, 2004 at 20:32 UTC
You seem to be more interested in playing with regexps than in really solving the problem, but may I interest you in an alternate way, a fun way in its own right, where the parsing has already been done. A way that might even work reliably? Look for modules that work on the parsed version, in the `B::` namespace, such as B::Deparse. And of course Hook::LexWrap will probably just do what you want...	[reply]
•Re: parsing code, finding block boundaries by merlyn (Sage) on Aug 05, 2004 at 22:38 UTC
To parse languages that are very similar to Perl, you might want to look at the rapidly evolving PPI, the pseudoperl parser. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: parsing code, finding block boundaries by duff (Parson) on Aug 05, 2004 at 21:48 UTC
I thought about using substr as an lvalue, but I don't know of any way to insert new text without overwriting existing text. The obvious way works ... `substr($code, $end, 0) = "}"; # -OR- substr($code, $end, 0, "}");` [download] duff	[reply] [d/l]