in reply to HTML::TreeBuilder, nesting with header vs font, different behavior
The short answer is, probably not with HTML::Treebuilder, but maybe with a more general SGML parser.
The longer answer is that, what you're parsing is not, strinctly speaking, valid HTML. Header tags have not been allowed to be nested in *any* version of HTML, *ever*, not even in the rediculous horrible terrible aweful unparseable messy HTML of the Netscape 3-4 era, and *certainly* not in any vaguely recent W3C specification. Consequently, an HTML parser is very unlikely to preserve such a construct. Frankly, if it did, I would call that a bug.
It *is* possible to rig up a parser than *can* preserve such things, but it would probably have to be based on a general SGML parser, rather than something HTML-specific since, as noted, what you're parsing isn't technically HTML. And it raises the question of why you would *want* to preserve nested header tags. If it were me, I would want that sort of thing to go away, fast.
|
|---|