tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Monks, this is a follow up to Parse::RecDescent and nesting, where b10m wanted to parse a bit of text with nested, sort-of html-like, tags. I came up with a partial solution, at Parse RecDescent nesting solved (95%). On second thought though, I'm not sure my 95% was justified here, because getting the last 5% is tripping me up. (Note: all code snips that follow were copied from my post.)

My input is

__DATA__ <open> tag 1 <close> notatag 1 <open> tag 2 <close> notatag 2 <open> tag 3 a <open> tag 3 b <close> tag 3 c <close> notatag 3 <open> tag 4 a <open> tag 4 b <close> <open> tag 4 c <open> tag 4 d <close> tag 4 e <close> tag 4 f <close> notatag 4
Output is
#tag: <open> tag 1 <close> #tag: <open> tag 2 <close> #tag: <open> tag 3 b <close> #tag: <open> tag 3 a ARRAY(0x1a55ad8) tag 3 c <close> #tag: <open> tag 4 b <close> #tag: <open> tag 4 d <close> #tag: <open> tag 4 c # ARRAY(0x1a8f664) tag 4 e # <close> #tag: <open> tag 4 a # ARRAY(0x1a8f670) tag 4 f # <close>
That first ugly hexref should be "<open> tag 3 b <close>" , the second should be
<open> tag 4 d <close> tag 4 e
and so on. Oh, my grammar was
q( start: chunk(s) chunk: tag | raw tag : stag raw etag { print "tag: $item[1] $item[2] $item[3]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3]"; } | stag raw tag(s) raw etag { print "tag: $item[1] $item[2] $item[3] $item[4] $item[5]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3] $item[4] $item[5]"; }
Short rant. I know the "right" way to parse html (which is what I really want to do, and, I suspect, also what b10m really wants) is to learn html::tokeparser/treebuilder, or some combination of the other html:: modules. But I am hesitant to do that, or at any rate to do *only* that, for more or less the reasons laid out by browseruk at Being a heretic and going against the party line, which was a followup to Parsing HTML tags with regex.

Maybe I'm the only monk that feels this way, but my gut tells me that parsing html should *feel* similar to the usual parsing I do with regexes... repeats should be done with * or +, groupings should be extracted with (), etc. I don't *want* to learn a whole 'nother parsing notation, that only works in one particular context (html). And, with perl 6 "super-regexes" (aka rules), this may actually be feasible. That comes sometime in the future, but what we have now, regexes plus parse::recdescent, should feel similar to what we ultimately wind up with when perl6 finally goes live (damian conway, in charge of per6 rules, adopted a lot of stuff from the existing parse::recdescent).

Googling wasn't all that helpful.

So... Can some Parse::recdescent master show me where I've been going wrong?

Thanks!

thomas

Replies are listed 'Best First'.
Re: Parse RecDescent Nesting (Followup)
by merlyn (Sage) on Jan 21, 2005 at 13:23 UTC
    I'm doing this quickly, as I'm rushing off to the airport for a cross-US trip, but it looks like your grammar is something like:
    document: chunk(s?) /\z/ chunk: open chunk close | word(s) open: "<open>" close: "<close>" word: /\w+/
    That should be enough to hang the proper return values from.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.


    update:
    document: stuff /\z/ { return $item{stuff} } stuff: chunk(s?) { return [map {@$_} @{$item[1]} ] } # flatten adjacen +t arrayrefs chunk: open stuff close { return $item{stuff} } | word(s?) # returns a +rrayref open: "<open>" close: "<close>" word: /\w+/
      Thanks again Randal (for the update). The stuff to flatten the arrayrefs looks promising, and I am trying to integrate that into what I have.

      The updated grammar seemed dodgy to me though, because it had a circular reference -- "stuff" has "chunks", and "chunks" have "stuff". Or are circular references allowed/desirable/cool in a grammar specification? Trying to understand what this was doing made my brain swim.

      Best, thomas.

      Note: this was second reply to Randal.

        Circular references are what permits nested open-close pairs. I wrote it the way I did so that the grammar was never ambiguous, and yet permits empty open-close pairs, or empty adjacent opens or closes. Allowing "empty" can get you into trouble in a grammar. And that fixes the problem from before.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

        The "circular references", or, as they're more commonly known, "recursive definitions", are precisely what differentiates a grammar from a flat, regular pattern, and the thing Perl's regex-engine can't properly handle, making a parser like Parse::RecDescent useful.
Re: Parse RecDescent Nesting (Followup)
by tphyahoo (Vicar) on Jan 21, 2005 at 13:48 UTC
    Thanks, Randal. I think what you suggested doesn't allow for something like

    <open>stuff <open>stuff <close><close>

    although it would allow

    <open><open>stuff <close><close>

    Or maybe I misunderstood. At any rate, it's food for thought and I will ponder on it (and report back if you were right and I was wrong...).

    Have a nice trip!

    thomas. Note for clarity: this was first reply to Randal (before his update), though it appears here lower in the string

Re: Parse RecDescent Nesting (Followup)
by tphyahoo (Vicar) on Jan 25, 2005 at 10:19 UTC
    Just found this.
    From the Parse::RecDescent FAQ:

    CAPTURING MATCHES
    Hey! I'm getting back ARRAY(0x355300) instead of what I set $return to!

    Here's a prime example of when this mistake is made:
    QuotedText: DoubleQuote TextChar(s?) DoubleQuote { my $chars = scalar(@item) - 1; $return = join ('', @item[2..$chars]) }
    This rule is incorrectly written. The author thinks that @item will have one TextChar from position 2 until all TextChars are matched. However, the true structure of @item is:

    position one: the string matched by rule DoubleQuote
    position two: array reference representing parse tree for TextChar(s?)
    position three: the string matched by rule DoubleQuote

    Note that position two is an array reference. So the rule must be rewritten in this way.
    QuotedText: DoubleQuote TextChar(s?) DoubleQuote { $return = join ( '', @{$item[2]} ) }
    ************************
    Further recommended reading.

    When to use Text::Balanced or Regexp::Common::balanced rather than P::RD: Parse::RecDescent