in reply to [Solved] Testing generated HTML

When I need to work with HTML tables, I usually reach for HTML::TableExtract. If you need to test the whole HTML, I'd use XML::LibXML which can load HTML as well as XML. It's less tolerant to poorly written HTML than other libraries, but as the HTML is generated by you, I'd say it's an advantage.

my $html = 'XML::LibXML'->load_html(string => \$generated_html);

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^2: Testing generated HTML
by davies (Monsignor) on Feb 21, 2016 at 19:21 UTC

    As the page develops, it will have links & the like that I will want to test. I am already testing for things like headings, shown in my OP, so I have been trying the XML approach. I regret to report that I've been getting no farther.

    The XML documentation mentions possible problems with HTML, especially with ampersands. The HTML I have so far contains none, but still failed (HTML parser error : Tag nav invalid <nav class="navbar navbar-inverse navbar-fixed-top">). This is something I have cargo culted in from the Bootstrap project. I saved my HTML to file and passed it through validator.w3.org, which reported no errors. I therefore set the "recover" parameter to 2 as suggested by the docs. This led to:

    use XML::LibXML; my $parser = XML::LibXML->new(recover => 2); my $xmltree = $parser->parse_html_string($html); my @nodes = $xmltree->getElementsByTagName('h1');

    Unfortunately, the @nodes array is empty, even though the tests I have working along the lines of the snippet in my OP are passing and the header is visible in the HTML. I then tried the "reader" module, thus:

    use XML::LibXML::Reader; my $reader = XML::LibXML::Reader->new(string => $html, recover => 2); while ($reader->read) { processNode($reader); } sub processNode { my $reader = shift; printf "%d %d %s %s\n", ($reader->depth, $reader->nodeType, $reader->name, $reader->value); }

    This starts off well enough, but crashes (I'm showing only the last printed info):

    7 8 #comment The above 3 meta tags *must* come first in the head; any + other head content must come *after* these tags Entity: line 21: parser error : Opening and ending tag mismatch: link +line 20 and head </head> ^

    I promise you there is no mismatch on the head tag, although there are "meta" and "link" tags between the last reported line and the closing head tag. Again I am having problems with the documentation, as https://metacpan.org/pod/distribution/XML-LibXML/lib/XML/LibXML/Parser.pod gives no information that I can see on how to get data out of the object. I suspect that there are things in the HTML that are beyond the powers of the XML suite even though they are validated. But not being able to see how to check means that I am far from sure.

    Any suggestions would be most welcome.

    Regards,

    John Davies

      Unfortunately, libxml2's HTML Parser doesn't support HTML5. If you want to use XML::LibXML, you need to switch to XHTML.

      XML::LibXML::Reader is a pull parser. It's used to process large XML documents that don't fit into memory. It interpreted the document as XML and didn't find a closing tag for the link element (as it's not needed in HTML). The documentation doesn't mention how to tell it to process HTML instead of XML, but I guess it doesn't support HTML5, either.

      See HTML::HTML5::Parser for an alternative (I haven't tried it myself).

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        Unfortunately, libxml2's HTML Parser doesn't support HTML5. If you want to use XML::LibXML, you need to switch to XHTML.

        Another solution might be to switch to Polyglot Markup. This is valid HTML5 which is also well-formed XML, so you get the best of both worlds. It was all the rage a few years back, but you don't seem to see it mentioned much nowadays