comment on

Monks, this is a follow up to Parse::RecDescent and nesting, where b10m wanted to parse a bit of text with nested, sort-of html-like, tags. I came up with a partial solution, at Parse RecDescent nesting solved (95%). On second thought though, I'm not sure my 95% was justified here, because getting the last 5% is tripping me up. (Note: all code snips that follow were copied from my post.)

My input is

__DATA__
<open> tag 1 <close> 
notatag 1
<open> tag 2 <close>
notatag 2
<open> tag 3 a <open> tag 3 b <close> tag 3 c <close>
notatag 3
<open> 
    tag 4 a 
    <open> tag 4 b <close> 
    <open> 
        tag 4 c 
        <open> tag 4 d <close>
        tag 4 e 
    <close> tag 4 f
<close>
notatag 4
[download]

Output is

#tag: <open> tag 1  <close>
        #tag: <open> tag 2  <close>
        #tag: <open> tag 3 b  <close>
        #tag: <open> tag 3 a  ARRAY(0x1a55ad8) tag 3 c  <close>
        #tag: <open> tag 4 b  <close>
        #tag: <open> tag 4 d  <close>
        #tag: <open> tag 4 c 
        #         ARRAY(0x1a8f664) tag 4 e 
        #     <close>
        #tag: <open> tag 4 a 
        #     ARRAY(0x1a8f670) tag 4 f
        # <close>
[download]

That first ugly hexref should be "<open> tag 3 b <close>" , the second should be

        <open> tag 4 d <close>
        tag 4 e
[download]

and so on. Oh, my grammar was

q(    start: chunk(s)
        chunk: tag | raw
        tag  : stag raw etag
        { 
        print "tag: $item[1] $item[2] $item[3]\n";
        #$return = "$item{stag} $item{raw} $item{etag}" 
        $return = "$item[1] $item[2] $item[3]";
        }
        | stag raw tag(s) raw etag
        { 
        print "tag: $item[1] $item[2] $item[3] $item[4] $item[5]\n";
        #$return = "$item{stag} $item{raw} $item{etag}" 
        $return = "$item[1] $item[2] $item[3] $item[4] $item[5]"; 
        }
[download]

Short rant. I know the "right" way to parse html (which is what I really want to do, and, I suspect, also what b10m really wants) is to learn html::tokeparser/treebuilder, or some combination of the other html:: modules. But I am hesitant to do that, or at any rate to do *only* that, for more or less the reasons laid out by browseruk at Being a heretic and going against the party line, which was a followup to Parsing HTML tags with regex.

Maybe I'm the only monk that feels this way, but my gut tells me that parsing html should *feel* similar to the usual parsing I do with regexes... repeats should be done with * or +, groupings should be extracted with (), etc. I don't *want* to learn a whole 'nother parsing notation, that only works in one particular context (html). And, with perl 6 "super-regexes" (aka rules), this may actually be feasible. That comes sometime in the future, but what we have now, regexes plus parse::recdescent, should feel similar to what we ultimately wind up with when perl6 finally goes live (damian conway, in charge of per6 rules, adopted a lot of stuff from the existing parse::recdescent).

Googling wasn't all that helpful.

So... Can some Parse::recdescent master show me where I've been going wrong?

Thanks!

thomas

In reply to Parse RecDescent Nesting (Followup) by tphyahoo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.