Monks, this is a follow up to Parse::RecDescent and nesting, where b10m wanted to parse a bit of text with nested, sort-of html-like, tags. I came up with a partial solution, at Parse RecDescent nesting solved (95%). On second thought though, I'm not sure my 95% was justified here, because getting the last 5% is tripping me up. (Note: all code snips that follow were copied from my post.)

My input is

__DATA__ <open> tag 1 <close> notatag 1 <open> tag 2 <close> notatag 2 <open> tag 3 a <open> tag 3 b <close> tag 3 c <close> notatag 3 <open> tag 4 a <open> tag 4 b <close> <open> tag 4 c <open> tag 4 d <close> tag 4 e <close> tag 4 f <close> notatag 4
Output is
#tag: <open> tag 1 <close> #tag: <open> tag 2 <close> #tag: <open> tag 3 b <close> #tag: <open> tag 3 a ARRAY(0x1a55ad8) tag 3 c <close> #tag: <open> tag 4 b <close> #tag: <open> tag 4 d <close> #tag: <open> tag 4 c # ARRAY(0x1a8f664) tag 4 e # <close> #tag: <open> tag 4 a # ARRAY(0x1a8f670) tag 4 f # <close>
That first ugly hexref should be "<open> tag 3 b <close>" , the second should be
<open> tag 4 d <close> tag 4 e
and so on. Oh, my grammar was
q( start: chunk(s) chunk: tag | raw tag : stag raw etag { print "tag: $item[1] $item[2] $item[3]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3]"; } | stag raw tag(s) raw etag { print "tag: $item[1] $item[2] $item[3] $item[4] $item[5]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3] $item[4] $item[5]"; }
Short rant. I know the "right" way to parse html (which is what I really want to do, and, I suspect, also what b10m really wants) is to learn html::tokeparser/treebuilder, or some combination of the other html:: modules. But I am hesitant to do that, or at any rate to do *only* that, for more or less the reasons laid out by browseruk at Being a heretic and going against the party line, which was a followup to Parsing HTML tags with regex.

Maybe I'm the only monk that feels this way, but my gut tells me that parsing html should *feel* similar to the usual parsing I do with regexes... repeats should be done with * or +, groupings should be extracted with (), etc. I don't *want* to learn a whole 'nother parsing notation, that only works in one particular context (html). And, with perl 6 "super-regexes" (aka rules), this may actually be feasible. That comes sometime in the future, but what we have now, regexes plus parse::recdescent, should feel similar to what we ultimately wind up with when perl6 finally goes live (damian conway, in charge of per6 rules, adopted a lot of stuff from the existing parse::recdescent).

Googling wasn't all that helpful.

So... Can some Parse::recdescent master show me where I've been going wrong?

Thanks!

thomas


In reply to Parse RecDescent Nesting (Followup) by tphyahoo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.