mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:
Greetings.
Perhaps this is bug with HTML Tidy. One of my scripts was subjected to some HTML today which passed through HTML Tidy but nonetheless crashed HTML::TreeBuilder::XPath. Below is a stripped down sample. The following Perl script produces the error message "Can't locate object method "as_XML_indented" via package " trololo " (perhaps you forgot to load " trololo "?) at ./script.pl line 12." and does not proceed through the rest of the script.
#!/usr/bin/perl use HTML::TreeBuilder::XPath; use strict; use warnings; my $tree = HTML::TreeBuilder::XPath->new_from_file(\*DATA); for my $body ($tree->findnodes('//body')) { for my $element ($body->detach_content) { print $element->as_XML_indented; } } print "\n"; print "OK\n"; exit(0); __DATA__ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.6.0" /> <title></title> </head> <body> <p>foo</p> <p>bar</p> trololo </body> </html>
Since the HTML seems to be valid, having just passed through HTML Tidy, I would have expected as_XML_indented to have just plowed through it, either rendering it as XML or at least not stopping. A work-around has been to wrap it in an eval,
#!/usr/bin/perl use HTML::TreeBuilder::XPath; use strict; use warnings; my $tree = HTML::TreeBuilder::XPath->new_from_file(\*DATA); for my $body ($tree->findnodes('//body')) { for my $element ($body->detach_content) { eval { print $element->as_XML_indented; }; if ($@) { print STDERR qq(\n),$@,qq(\n); print STDERR qq(Failed HTML.\n); } } print "\n"; print "OK\n"; exit(0); __DATA__ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Linux version 5.6.0" /> <title></title> </head> <body> <p>foo</p> <p>bar</p> trololo </body> </html>
I'm not sure how to interpret the HTML5 spec. However, the HTML4 spec seems to indicate that the loose text ought to have been wrapped in a block element of some kind.
So if I may tap your collective wisdom,
General comments and advice also welcome.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Seemingly Valid HTML which crashes HTML::TreeBuilder::XPath
by choroba (Cardinal) on Nov 10, 2023 at 11:40 UTC | |
by mldvx4 (Friar) on Nov 10, 2023 at 13:13 UTC | |
by hippo (Archbishop) on Nov 10, 2023 at 14:40 UTC | |
Re: Seemingly Valid HTML which crashes HTML::TreeBuilder::XPath
by Corion (Patriarch) on Nov 10, 2023 at 11:15 UTC | |
by mldvx4 (Friar) on Nov 10, 2023 at 11:38 UTC |