Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Nesting regexen

by colink (Novice)
on Jul 11, 2005 at 04:55 UTC ( [id://473821]=perlquestion: print w/replies, xml ) Need Help??

colink has asked for the wisdom of the Perl Monks concerning the following question:

I needed to reformat a bunch of Perl code, so I began writing this as an exercise to try and learn regular expressions better, but I've gotten stuck and I'm not sure why.

It's a very light weight perl parser that handles just a subset of perl.

Here's the code:

#!/usr/bin/perl -w use strict; my $bareword = qr/(\w+)/; my $quotelike = qr/((['"]).+?\2)/; my $subscript = qr/([\[{]\w+[\]\}])/; my $variable = qr/(\$\w+($subscript)*)/; my $sub_arg = qr/($quotelike|$variable|$bareword)/; my $sub_args = qr/($sub_arg,)*($sub_arg)/; my $subroutine = qr/((\w+::)*\w+\($sub_args)/; while (<>) { my @args = (); my @labels = (); my ($spacer, $obj, $method) = m/^(\s+)(\$\w+->)(\w+)\(/gc; LOOP: { push(@args, $1), redo LOOP if m/\G$quotelike,?\s*/gc; push(@args, $1), redo LOOP if m/\G$variable,?\s*/gc; push(@args, $1), redo LOOP if m/\G$subroutine,?\s*/gc; push(@args, $1), redo LOOP if m/\G$bareword,?\s*/gc; } @labels = ($method eq "hidden") ? qw(name value) : ##integer, text qw(name label value maxlength ex +tras subtext size uiLevel defaultValue hoverHelp) ; print join '', $spacer, $obj, $method, "(\n"; for(my $index=0; $index < @args; ++$index) { print join '', "\t\t-", $labels[$index], ' => ', $args[$index] +, ",\n"; } print join '', $spacer, ");\n"; }

This is the line I'm trying to parse:

$f->readOnly($session{form}{cid},WebGUI::International::get(469,"W +ebGUIProfile"));

And this is the formatted output, incorrect:

$f->readOnly( -name => $session{form}{cid}, -label => WebGUI::International::get(469, -value => "WebGUIProfile", );

The $sub_args regex is only matching the second set of parentheses, and I have no idea why. Can anyone clue me in?

Replies are listed 'Best First'.
Re: Nesting regexen
by nothingmuch (Priest) on Jul 11, 2005 at 07:25 UTC
    Regular expressions are not parsers.

    A regular expression can refer to itself, but that is a kludgy hack. To do that use the ??{ } form inside a regex.

    There are better solutions though.

    If you want to clean up your perl code, perltidy helps. If you want to do custom transformations, PPI is well suited for the job.

    Larry Wall is working on the perl 5 to perl 6 converter, and he's using the parsing code from perl 5 itself to emit a canonical format... When he makes a release that might be useful.

    If you want to parse it on your own, try looking at Parse::RecDescent.

    Back to your problem - since perl's grammar is recursive: (... ( sub expression ) ...) you have to keep track of parenthesis balancing. You need to use some kind of stack structure (an explicit one or the call stack) to nibble paren tokens, and construct nested subexpressions. Once you can build that much you need to reserialize your structures back. Since you are only going two levels deep, this will be a problem soon.

    You can use regexes to find tokens, but the notion of state must be maintained, and this is not taken into account in your code.

    Your subroutine regex is the problem there, btw. It doesn't match a closing paren, so the collected string ends in "WebGUIProfile", and stops there. A hard coded paren for the method call is printed, and that's where it ends.

    -nuffin
    zz zZ Z Z #!perl
Re: Nesting regexen
by djp (Hermit) on Jul 11, 2005 at 05:48 UTC
    Without looking at your actual problem, and recognizing that you're doing this as an exercise, can I nonetheless recommend Perl::Tidy to reformat your code, leaving you free for other more useful things? Parsing Perl is notoriously difficult, most say impossible.
      I should have described what the code needs to do better. It changes subroutine calls from having positional based parameters to using hash based parameters. Perl::Tidy won't take care of that.
Re: Nesting regexen
by Daedalus207 (Novice) on Jul 11, 2005 at 08:05 UTC
    I'm not sure exactly what you want for output, but the problem that you identified, having to do with $sub_args , seems to be associated with $quotelike. Try using
    qr/(["'])($bareword)*(["'])/
    instead of
    qr/((['"]).+?\2)/
    The output with that change is:
    $f->readOnly( -name => $session{form}{cid}, -label => WebGUI::International::get(469,"WebGUIProfil +e", );
    That way, the quoted portion is recognized as part of $subroutine.

      I see (after staring at your post for 20 minutes). The problem is with the backreference \2. It's only 2 when it isn't embedded inside another regex.

      Thank you very much!

Re: Nesting regexen
by tphyahoo (Vicar) on Jul 11, 2005 at 08:54 UTC
    Strongly agree with nothingmuch. This is tilting at windmills, and nothing real useful is likely to come of it, probably not even with Parse::Recdescent.

    As merlyn explained, even perl6's "regexes-on-steroids" rules (based largely on P::RD) won't be able to parse perl: Re: Perl not BNF-able??.

Re: Nesting regexen
by BrowserUk (Patriarch) on Jul 11, 2005 at 08:32 UTC

    In addition to Daedalus207' fix, you are also missing a close paren literal in your subroutine regex:

    my $subroutine = qr/((\w+::)*\w+\($sub_args\))/; # ^

    You might also want to add some optional whitespace at strategic points unless you are striping this first.

    Your regex would be easier to read and maintain if you used /x.

    my $subroutine = qr/ ( (\w+::)* \w+ \( $sub_args \) ) /x;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      The code that I posted is hacked up as the result of about two hours of debugging. The original used /x and matched the closing paren.
Re: Nesting regexen
by bronson (Initiate) on Jul 11, 2005 at 06:13 UTC
    You're more likely to get an answer if you ask a more manageable question. As it is, it sounds like you're asking somebody else to do your debugging for you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://473821]
Approved by friedo
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (1)
As of 2024-04-25 00:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found