As the page develops, it will have links & the like that I will want to test. I am already testing for things like headings, shown in my OP, so I have been trying the XML approach. I regret to report that I've been getting no farther.
The XML documentation mentions possible problems with HTML, especially with ampersands. The HTML I have so far contains none, but still failed (HTML parser error : Tag nav invalid <nav class="navbar navbar-inverse navbar-fixed-top">). This is something I have cargo culted in from the Bootstrap project. I saved my HTML to file and passed it through validator.w3.org, which reported no errors. I therefore set the "recover" parameter to 2 as suggested by the docs. This led to:
use XML::LibXML;
my $parser = XML::LibXML->new(recover => 2);
my $xmltree = $parser->parse_html_string($html);
my @nodes = $xmltree->getElementsByTagName('h1');
Unfortunately, the @nodes array is empty, even though the tests I have working along the lines of the snippet in my OP are passing and the header is visible in the HTML. I then tried the "reader" module, thus:
use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(string => $html, recover => 2);
while ($reader->read) {
processNode($reader);
}
sub processNode {
my $reader = shift;
printf "%d %d %s %s\n", ($reader->depth,
$reader->nodeType,
$reader->name,
$reader->value);
}
This starts off well enough, but crashes (I'm showing only the last printed info):
7 8 #comment The above 3 meta tags *must* come first in the head; any
+ other head content must come *after* these tags
Entity: line 21: parser error : Opening and ending tag mismatch: link
+line 20 and head
</head>
^
I promise you there is no mismatch on the head tag, although there are "meta" and "link" tags between the last reported line and the closing head tag. Again I am having problems with the documentation, as https://metacpan.org/pod/distribution/XML-LibXML/lib/XML/LibXML/Parser.pod gives no information that I can see on how to get data out of the object. I suspect that there are things in the HTML that are beyond the powers of the XML suite even though they are validated. But not being able to see how to check means that I am far from sure.
Any suggestions would be most welcome.
Regards,
John Davies |