Parse RecDescent Nesting (Followup)

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Monks, this is a follow up to Parse::RecDescent and nesting, where b10m wanted to parse a bit of text with nested, sort-of html-like, tags. I came up with a partial solution, at Parse RecDescent nesting solved (95%). On second thought though, I'm not sure my 95% was justified here, because getting the last 5% is tripping me up. (Note: all code snips that follow were copied from my post.)

My input is

__DATA__
<open> tag 1 <close> 
notatag 1
<open> tag 2 <close>
notatag 2
<open> tag 3 a <open> tag 3 b <close> tag 3 c <close>
notatag 3
<open> 
    tag 4 a 
    <open> tag 4 b <close> 
    <open> 
        tag 4 c 
        <open> tag 4 d <close>
        tag 4 e 
    <close> tag 4 f
<close>
notatag 4
[download]

Output is

#tag: <open> tag 1  <close>
        #tag: <open> tag 2  <close>
        #tag: <open> tag 3 b  <close>
        #tag: <open> tag 3 a  ARRAY(0x1a55ad8) tag 3 c  <close>
        #tag: <open> tag 4 b  <close>
        #tag: <open> tag 4 d  <close>
        #tag: <open> tag 4 c 
        #         ARRAY(0x1a8f664) tag 4 e 
        #     <close>
        #tag: <open> tag 4 a 
        #     ARRAY(0x1a8f670) tag 4 f
        # <close>
[download]

That first ugly hexref should be "<open> tag 3 b <close>" , the second should be

        <open> tag 4 d <close>
        tag 4 e
[download]

and so on. Oh, my grammar was

q(    start: chunk(s)
        chunk: tag | raw
        tag  : stag raw etag
        { 
        print "tag: $item[1] $item[2] $item[3]\n";
        #$return = "$item{stag} $item{raw} $item{etag}" 
        $return = "$item[1] $item[2] $item[3]";
        }
        | stag raw tag(s) raw etag
        { 
        print "tag: $item[1] $item[2] $item[3] $item[4] $item[5]\n";
        #$return = "$item{stag} $item{raw} $item{etag}" 
        $return = "$item[1] $item[2] $item[3] $item[4] $item[5]"; 
        }
[download]

Short rant. I know the "right" way to parse html (which is what I really want to do, and, I suspect, also what b10m really wants) is to learn html::tokeparser/treebuilder, or some combination of the other html:: modules. But I am hesitant to do that, or at any rate to do *only* that, for more or less the reasons laid out by browseruk at Being a heretic and going against the party line, which was a followup to Parsing HTML tags with regex.

Maybe I'm the only monk that feels this way, but my gut tells me that parsing html should *feel* similar to the usual parsing I do with regexes... repeats should be done with * or +, groupings should be extracted with (), etc. I don't *want* to learn a whole 'nother parsing notation, that only works in one particular context (html). And, with perl 6 "super-regexes" (aka rules), this may actually be feasible. That comes sometime in the future, but what we have now, regexes plus parse::recdescent, should feel similar to what we ultimately wind up with when perl6 finally goes live (damian conway, in charge of per6 rules, adopted a lot of stuff from the existing parse::recdescent).

Googling wasn't all that helpful.

So... Can some Parse::recdescent master show me where I've been going wrong?

Thanks!

thomas

Comment on Parse RecDescent Nesting (Followup) Select or Download Code

Replies are listed 'Best First'.
Re: Parse RecDescent Nesting (Followup) by merlyn (Sage) on Jan 21, 2005 at 13:23 UTC
I'm doing this quickly, as I'm rushing off to the airport for a cross-US trip, but it looks like your grammar is something like: `document: chunk(s?) /\z/ chunk: open chunk close \| word(s) open: "<open>" close: "<close>" word: /\w+/` [download] That should be enough to hang the proper return values from. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply. update: `document: stuff /\z/ { return $item{stuff} } stuff: chunk(s?) { return [map {@$_} @{$item[1]} ] } # flatten adjacen +t arrayrefs chunk: open stuff close { return $item{stuff} } \| word(s?) # returns a +rrayref open: "<open>" close: "<close>" word: /\w+/` [download]	[reply] [d/l] [select]
Re^2: Parse RecDescent Nesting (Followup) by tphyahoo (Vicar) on Jan 21, 2005 at 15:53 UTC
Thanks again Randal (for the update). The stuff to flatten the arrayrefs looks promising, and I am trying to integrate that into what I have. The updated grammar seemed dodgy to me though, because it had a circular reference -- "stuff" has "chunks", and "chunks" have "stuff". Or are circular references allowed/desirable/cool in a grammar specification? Trying to understand what this was doing made my brain swim. Best, thomas. Note: this was second reply to Randal.	[reply]
Re^3: Parse RecDescent Nesting (Followup) by merlyn (Sage) on Jan 21, 2005 at 23:55 UTC
Circular references are what permits nested open-close pairs. I wrote it the way I did so that the grammar was never ambiguous, and yet permits empty open-close pairs, or empty adjacent opens or closes. Allowing "empty" can get you into trouble in a grammar. And that fixes the problem from before. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^3: Parse RecDescent Nesting (Followup) by bart (Canon) on Jan 22, 2005 at 13:07 UTC
The "circular references", or, as they're more commonly known, "recursive definitions", are precisely what differentiates a grammar from a flat, regular pattern, and the thing Perl's regex-engine can't properly handle, making a parser like Parse::RecDescent useful.	[reply]
Re: Parse RecDescent Nesting (Followup) by tphyahoo (Vicar) on Jan 21, 2005 at 13:48 UTC
Thanks, Randal. I think what you suggested doesn't allow for something like <open>stuff <open>stuff <close><close> although it would allow <open><open>stuff <close><close> Or maybe I misunderstood. At any rate, it's food for thought and I will ponder on it (and report back if you were right and I was wrong...). Have a nice trip! thomas. Note for clarity: this was first reply to Randal (before his update), though it appears here lower in the string	[reply]
Re: Parse RecDescent Nesting (Followup) by tphyahoo (Vicar) on Jan 25, 2005 at 10:19 UTC
Just found this. From the Parse::RecDescent FAQ: CAPTURING MATCHES Hey! I'm getting back ARRAY(0x355300) instead of what I set $return to! Here's a prime example of when this mistake is made: `QuotedText: DoubleQuote TextChar(s?) DoubleQuote { my $chars = scalar(@item) - 1; $return = join ('', @item[2..$chars]) }` [download] This rule is incorrectly written. The author thinks that @item will have one TextChar from position 2 until all TextChars are matched. However, the true structure of @item is: position one: the string matched by rule DoubleQuote position two: array reference representing parse tree for TextChar(s?) position three: the string matched by rule DoubleQuote Note that position two is an array reference. So the rule must be rewritten in this way. `QuotedText: DoubleQuote TextChar(s?) DoubleQuote { $return = join ( '', @{$item[2]} ) }` [download] ************************ Further recommended reading. When to use Text::Balanced or Regexp::Common::balanced rather than P::RD: Parse::RecDescent	[reply] [d/l] [select]