in reply to HTML::TreeBuilder, nesting with header vs font, different behavior

The short answer is, probably not with HTML::Treebuilder, but maybe with a more general SGML parser.

The longer answer is that, what you're parsing is not, strinctly speaking, valid HTML. Header tags have not been allowed to be nested in *any* version of HTML, *ever*, not even in the rediculous horrible terrible aweful unparseable messy HTML of the Netscape 3-4 era, and *certainly* not in any vaguely recent W3C specification. Consequently, an HTML parser is very unlikely to preserve such a construct. Frankly, if it did, I would call that a bug.

It *is* possible to rig up a parser than *can* preserve such things, but it would probably have to be based on a general SGML parser, rather than something HTML-specific since, as noted, what you're parsing isn't technically HTML. And it raises the question of why you would *want* to preserve nested header tags. If it were me, I would want that sort of thing to go away, fast.


"In adjectives, with the addition of inflectional endings, a changeable long vowel (Qamets or Tsere) in an open, propretonic syllable will reduce to Vocal Shewa. This type of change occurs when the open, pretonic syllable of the masculine singular adjective becomes propretonic with the addition of inflectional endings."  — Pratico & Van Pelt, BBHG, p68
  • Comment on Re: HTML::TreeBuilder, nesting with header vs font, different behavior