tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:
My input is
Output is__DATA__ <open> tag 1 <close> notatag 1 <open> tag 2 <close> notatag 2 <open> tag 3 a <open> tag 3 b <close> tag 3 c <close> notatag 3 <open> tag 4 a <open> tag 4 b <close> <open> tag 4 c <open> tag 4 d <close> tag 4 e <close> tag 4 f <close> notatag 4
That first ugly hexref should be "<open> tag 3 b <close>" , the second should be#tag: <open> tag 1 <close> #tag: <open> tag 2 <close> #tag: <open> tag 3 b <close> #tag: <open> tag 3 a ARRAY(0x1a55ad8) tag 3 c <close> #tag: <open> tag 4 b <close> #tag: <open> tag 4 d <close> #tag: <open> tag 4 c # ARRAY(0x1a8f664) tag 4 e # <close> #tag: <open> tag 4 a # ARRAY(0x1a8f670) tag 4 f # <close>
and so on. Oh, my grammar was<open> tag 4 d <close> tag 4 e
Short rant. I know the "right" way to parse html (which is what I really want to do, and, I suspect, also what b10m really wants) is to learn html::tokeparser/treebuilder, or some combination of the other html:: modules. But I am hesitant to do that, or at any rate to do *only* that, for more or less the reasons laid out by browseruk at Being a heretic and going against the party line, which was a followup to Parsing HTML tags with regex.q( start: chunk(s) chunk: tag | raw tag : stag raw etag { print "tag: $item[1] $item[2] $item[3]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3]"; } | stag raw tag(s) raw etag { print "tag: $item[1] $item[2] $item[3] $item[4] $item[5]\n"; #$return = "$item{stag} $item{raw} $item{etag}" $return = "$item[1] $item[2] $item[3] $item[4] $item[5]"; }
Maybe I'm the only monk that feels this way, but my gut tells me that parsing html should *feel* similar to the usual parsing I do with regexes... repeats should be done with * or +, groupings should be extracted with (), etc. I don't *want* to learn a whole 'nother parsing notation, that only works in one particular context (html). And, with perl 6 "super-regexes" (aka rules), this may actually be feasible. That comes sometime in the future, but what we have now, regexes plus parse::recdescent, should feel similar to what we ultimately wind up with when perl6 finally goes live (damian conway, in charge of per6 rules, adopted a lot of stuff from the existing parse::recdescent).
Googling wasn't all that helpful.
So... Can some Parse::recdescent master show me where I've been going wrong?
Thanks!
thomas
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parse RecDescent Nesting (Followup)
by merlyn (Sage) on Jan 21, 2005 at 13:23 UTC | |
by tphyahoo (Vicar) on Jan 21, 2005 at 15:53 UTC | |
by merlyn (Sage) on Jan 21, 2005 at 23:55 UTC | |
by bart (Canon) on Jan 22, 2005 at 13:07 UTC | |
|
Re: Parse RecDescent Nesting (Followup)
by tphyahoo (Vicar) on Jan 21, 2005 at 13:48 UTC | |
|
Re: Parse RecDescent Nesting (Followup)
by tphyahoo (Vicar) on Jan 25, 2005 at 10:19 UTC |