in reply to Parse::RecDescent Woes

This answer assumes that you stick with ECD as the file format. With < allowed in raw text, the trick is determining whether a given < starts a tag or is just part of raw text. I think the rule you want is: m{<(?=(?:/[a-zA-Z]+|[a-zA-Z]+/?)>)} matches < if and only if it's the start of a tag, endtag, or unitag.

In that case, I think a regex like this would work as the definition for rawtext: m{(?:[^<]|<(?!(?:/[a-zA-Z]+|[a-zA-Z]+/?)>))+} I did a few quick tests of this regex with demo_simpleXML.pl, and it worked as intended.

(The redundant [a-zA-Z]+ could be eliminated using the (?(condition)...) regex feature, added in perl5.005: m{(?:[^<]|<(?!(/)?[a-zA-Z]+(?(1)|/?)>))+} If the (/) matches, then (?(1)|/?) will match the null string; if the (/) does not match, then (?(1)|/?) will match /?. So, / can be at the beginning or the end, but not both. )

Replies are listed 'Best First'.
Re: Re: Parse::RecDescent Woes
by beppu (Hermit) on Jan 05, 2001 at 06:58 UTC
    chipmunk, you are a master. Thanks so much for that regular expression.

    I predict that you will defeat Ovid in the Iron Perl Monks battle. Nothing against Ovid, of course. I'm sure he's a nice guy.