in reply to Ensuring HTML is "balanced"

HTML::Treebuilder is a good answer. It is pretty tolerant of missing close tags and can generate nice HTML output if you ask it nicely. You may also be interested in HTML::Lint which parses HTML and generates an error report.

use strict; use warnings; use HTML::TreeBuilder; use HTML::Lint; my $html = do {local $/; (<DATA>)}; my $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE +); $lint->parse ($html); $lint->eof (); print "HTML::Lint report:\n"; print join "\n", map {$_->as_string ()} $lint->errors (); my $tree = HTML::TreeBuilder->new (); $tree->parse ($html); $tree->eof (); print "\n\nTreeBuilder cleaned up HTML\n"; print $tree->as_HTML (); __DATA__ <p><b><i>test</b></p>

Prints:

HTML::Lint report: (1:14) <i> at (1:7) is never closed (1:18) <body> tag is required (1:18) <head> tag is required (1:18) <html> tag is required (1:18) <title> tag is required TreeBuilder cleaned up HTML <html><head></head><body><p><b><i>test</i></b></body></html>

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re^2: Ensuring HTML is "balanced"
by Anonymous Monk on Mar 25, 2008 at 19:03 UTC
    ...So since the cleaned up HTML in fact has a broken p tag, are we free to assume that the Lint report *and* Treebuilder handle P tags in an amusing manner?

      Actually HTML doesn't require that some tags (including p tags) be closed. In particular the HTML 4.01 specification in section 9.3.1 says:

      Paragraphs: the P element
      Start tag: required, End tag: optional

      so strictly speaking the p tag is not broken.


      Perl is environmentally friendly - it saves trees
      This is a common misconception (and one which I think reflects what the standard *SHOULD* be). Even though the close </p> tag is mandatory for certain other standards, </p> is optional in html 4.01, per http://www.w3.org/TR/html401/struct/text.htm and other w3c references:
      9.3.1 Paragraphs: the P element
      ...
      Start tag: required, End tag: optional

      Whether or not this stands in the forthcoming html 5.0 standard is unknown.