GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

The code below should round trip a small HTML document. However, the output version has moved the declaration and comment lines from the start of the file to after the body. Is this expected behaviour (if so, why?) or a bug in TreeBuilder?

use strict; use warnings; use HTML::TreeBuilder; my $data = do {local $/ = ""; <DATA>}; my $tree = HTML::TreeBuilder->new; $tree->store_comments(1); $tree->store_declarations(1); $tree->parse ($data); $tree->eof (); print $tree->as_HTML(undef, ' '); __DATA__ <!DOCTYPE html PUBLIC> <!-- saved from url --> <html lang="en"> <head> </head> <body> </body> </html>

The output generated is:

<html lang="en"> <head> </head> <body> </body><!DOCTYPE html PUBLIC><!-- saved from url --> </html>
Update:

I'm using HTML::TreeBuilder 1.01

Replacing the print line in the code above with the following code works around the problem (this would require something a little smarter in "real" code):

my @prefix = @{$tree->{_content}}[2..3]; @{$tree->{_content}} = @{$tree->{_content}}[0..1]; print $prefix[0]->as_HTML(undef, ' '); print $prefix[1]->as_HTML(undef, ' ');

Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re: HTML::TreeBuilder bug or feature?
by gam3 (Curate) on Sep 07, 2005 at 03:23 UTC
    From the HTML::TreeBuilder man page:
      $root->store_declarations(value)
          This determines whether TreeBuilder will normally
          store markup declarations found while parsing content
          into $root.  Currently, this is off by default.
    
          It is somewhat of a known bug (to be fixed one of 
          these days, if anyone needs it?) that declarations in 
          the preamble (before the "html" start-tag) end up
          actually under the "html" element.
    
    So that makes it a bug.
    -- gam3
    A picture is worth a thousand words, but takes 200K.
Re: HTML::TreeBuilder bug or feature?
by pg (Canon) on Sep 07, 2005 at 03:58 UTC

    Update: Saw GrandFather's reply. There is some misunderstanding. I guess that that was my fault, and I could have expressed myself more clearly. I have never doubted the merit of his code. The intention of this reply was/is not about whether his orginal HTML was valid. What I am saying is that, as_html converts a valid document to an invalid document, and that is a bug. To prove that, I needed a valid HTML document (in a more strict sense) that can pass the validation service, and that's all why I modified the original code. Otherwise, both HTML's before and after convertion fail the validation, and I cannot prove my point. No worries, GrandFather ;-)

    =========================================

    This is a bug, by HTML 4.01 specification. You do not need to be familiar with the specification, we can use W3C validate service to verify those HTML documents in this reply.

    I modified your code a little bit to contain a valid HTML document. The HTML document passed W3C validation as tentatively valid.

    use strict; use warnings; use HTML::TreeBuilder; my $data = do {local $/ = ""; <DATA>}; my $tree = HTML::TreeBuilder->new; $tree->store_comments(1); $tree->store_declarations(1); $tree->parse ($data); $tree->eof (); print $tree->as_HTML(); __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML> <HEAD> <TITLE>My first HTML document</TITLE> </HEAD> <BODY> <P>Hello world! </BODY> </HTML>

    Run this program it generates:

    <html><head><title>My first HTML document</title></head><body><p>Hello + world! </body><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"></html>

    And this generated HTML does not pass the validate service. It complains that DOCTYPE cannot be found and is misplaced.

      I trimmed the !DOCTYPE tag contents down because I didn't need more than that to demonstrate the problem, which is that the declaration tag and the comment tag move to after the body when the HTML is generated. Actually, that is the way it is stored in $tree. Swapping the two problematic tag entries in $tree->{_content} to the start of the array "fixes" the problem.

      The sample "HTML" is not intended to be valid beyond the extent needed to demonstrate the problem.


      Perl is Huffman encoded by design.

        I guess that it was my fault. Please see my update above. I understand perfectly that we all cherish our program and don't want others to modify it, as if there was something wrong with it.

        I really don't want you to take it in the wrong way, and I am sorry if I made you felt bad, although it was not my intention. My intention was definitely not what you thought it was, and I was not commenting your code. I was merely try to prove that it was a bug, but from a HTML specification point of view.

        Put in this way, had the coverted HTML with DOCTYPE at its end also passed the validation, as did the original one, I would have probably said that it was not a bug.