tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I am cleaning up thousands up messy html files, many with ugly nesting, using HTML::TreeBuilder. The following two bits of messily nested html are structured the same, only in one there is nesting with a header tag, and in the other there is nesting with a font tag. As you can see in the output, only the nesting with font is preserved. I would like both strings to parse identically. Is there some way I can get TreeBuilder to do that?

use strict; use warnings; use HTML::TreeBuilder; use HTML::Element; my $html_bold = <<END; <b><h2> Nested once. <b><h2>Nested twice.</h2></b> Nested once. </h2> </b> END my $tree_bold = HTML::TreeBuilder->new_from_content($html_bold); print "Bold:\n\n"; $tree_bold->dump(); print "\n"; #The last "nested once" is actually not nested at all, as can be seen +from the indenting. #And the nested twice isn't quite right either. # #<html> @0 (IMPLICIT) # <head> @0.0 (IMPLICIT) # <body> @0.1 (IMPLICIT) # <b> @0.1.0 # <h2> @0.1.0.0 # " Nested once. " # <b> @0.1.0.0.1 # <h2> @0.1.0.1 # "Nested twice." # " Nested once. " my $html_font = <<END; <b><font color="red"> Nested once. <b><font color="red"><h2>Nested twice.</font></b> Nested once. </font> </b> END my $tree_font = HTML::TreeBuilder->new_from_content($html_font); print "Font:\n\n"; $tree_font->dump(); print "\n"; #Everything is nested like in the original. #<html> @0 (IMPLICIT) # <head> @0.0 (IMPLICIT) # <body> @0.1 (IMPLICIT) # <b> @0.1.0 # <font color="red"> @0.1.0.0 # " Nested once. " # <b> @0.1.0.0.1 # <font color="red"> @0.1.0.0.1.0 # <h2> @0.1.0.0.1.0.0 # "Nested twice." # " Nested once. " # " "
  • Comment on HTML::TreeBuilder, nesting with header vs font, different behavior
  • Download Code

Replies are listed 'Best First'.
Re: HTML::TreeBuilder, nesting with header vs font, different behavior
by jonadab (Parson) on Jun 28, 2005 at 10:54 UTC

    The short answer is, probably not with HTML::Treebuilder, but maybe with a more general SGML parser.

    The longer answer is that, what you're parsing is not, strinctly speaking, valid HTML. Header tags have not been allowed to be nested in *any* version of HTML, *ever*, not even in the rediculous horrible terrible aweful unparseable messy HTML of the Netscape 3-4 era, and *certainly* not in any vaguely recent W3C specification. Consequently, an HTML parser is very unlikely to preserve such a construct. Frankly, if it did, I would call that a bug.

    It *is* possible to rig up a parser than *can* preserve such things, but it would probably have to be based on a general SGML parser, rather than something HTML-specific since, as noted, what you're parsing isn't technically HTML. And it raises the question of why you would *want* to preserve nested header tags. If it were me, I would want that sort of thing to go away, fast.


    "In adjectives, with the addition of inflectional endings, a changeable long vowel (Qamets or Tsere) in an open, propretonic syllable will reduce to Vocal Shewa. This type of change occurs when the open, pretonic syllable of the masculine singular adjective becomes propretonic with the addition of inflectional endings."  — Pratico & Van Pelt, BBHG, p68
Re: HTML::TreeBuilder, nesting with header vs font, different behavior
by metaperl (Curate) on Jun 29, 2005 at 17:27 UTC
    HTML::PrettyPrinter is based on TreeBuilder and reformats files for readability. If I were doing this, I would use HTML-Tidy, a C program. And I would fix things based on what HTML-Tidy told me was wrong manually. It sounds like the HTML is seriously b0rk3d