davies has asked for the wisdom of the Perl Monks concerning the following question:

X: I am writing something that uses Dancer2 and Template::Toolkit to produce HTML. I want to write tests along the lines of "is the fourth column of the third row of the second table what I expect?".

Y: I am using HTML::TreeBuilder. The documentation for this is sending me off on a yak shaving exercise that is getting me frustrated. So:

While you could access the content of a tree by writing code that says "access the 'src' attribute of the root's first child's seventh child's third child", you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to "traverse" it; an HTML::Element method ($h->traverse) is provided for this purpose
looks helpful. Until you read the relevant section, which says:
Lengthy discussion of HTML::Element's unnecessary and confusing traverse method has been moved to a separate file: HTML::Element::traverse

So I go there, to find this:

or you can just be simple and clear (and not have to understand the calling format for traverse) by writing a sub that traverses the tree by just calling itself:
{ my $counter = 'x0000'; sub give_id { my $x = $_[0]; $x->attr('id', $counter++) unless defined $x->attr('id'); foreach my $c ($x->content_list) { give_id($c) if ref $c; # ignore text nodes } }; give_id($start_node); }
See, isn't that nice and clear?

No, it's foul and opaque. I can't see what the purpose of it is, nor can I see how it achieves its purpose nor can I see how to hack this to give me table 2, row 3, column 4. I do have some working code that looks like:

my $tree = HTML::TreeBuilder->new_from_content($html); $tree->elementify(); my $tagmap = $tree->tagname_map(); ok('Matrix' eq $$tagmap{'h2'}[1]{'_content'}[0], "Got correct title (M +atrix)");

but every time I look at one of these elements, I get things like '_parent' => $VAR1->{'_parent'}{'_parent'}{'_parent'}{'_parent'}{'_parent'},, leading me to suspect that every element object contains a reworked copy of the entire HTML tree and getting me no closer to what I want.

As I have indicated, I've tried reading the docs but have found them unhelpful. I've looked for external tutorials without great success (although I've seen lots of suggestions for using other modules, investigation of which has also cost time without making progress). Am I using a sensible tool (and if not, what should I use)? How should I be using it to do something I did not expect to be difficult?

Regards,

John Davies

Update: a combination of the answers with some help from Berends has got me working. The CSS and other definitions I had brought in from Bootstrap used <meta ...> tags that don't play nicely with XML. However, a simple change to <meta .../> (and the same for link tags) was enough for the XML parser to play nicely with me. shmem's advice to use XPath also works well, so now my code looks like:
func textfromxml($xp, $xpath) { my $nodeset = $xp->find($xpath); my ($text) = XML::XPath::XMLParser::as_string(($nodeset->get_nodel +ist)[0]) =~ />(.*)</; return $text; } use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(xml => $html); my $text = textfromxml($xp, '/html/body/div/div/h2[2]'); ok('Matrix' eq $text, "Got correct title (Matrix)");
(snippage hath occurred; I may have blundered) and all tests pass with indexing done in the XPath. Thanks all.

Replies are listed 'Best First'.
Re: Testing generated HTML
by choroba (Cardinal) on Feb 21, 2016 at 17:16 UTC
    When I need to work with HTML tables, I usually reach for HTML::TableExtract. If you need to test the whole HTML, I'd use XML::LibXML which can load HTML as well as XML. It's less tolerant to poorly written HTML than other libraries, but as the HTML is generated by you, I'd say it's an advantage.

    my $html = 'XML::LibXML'->load_html(string => \$generated_html);

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      As the page develops, it will have links & the like that I will want to test. I am already testing for things like headings, shown in my OP, so I have been trying the XML approach. I regret to report that I've been getting no farther.

      The XML documentation mentions possible problems with HTML, especially with ampersands. The HTML I have so far contains none, but still failed (HTML parser error : Tag nav invalid <nav class="navbar navbar-inverse navbar-fixed-top">). This is something I have cargo culted in from the Bootstrap project. I saved my HTML to file and passed it through validator.w3.org, which reported no errors. I therefore set the "recover" parameter to 2 as suggested by the docs. This led to:

      use XML::LibXML; my $parser = XML::LibXML->new(recover => 2); my $xmltree = $parser->parse_html_string($html); my @nodes = $xmltree->getElementsByTagName('h1');

      Unfortunately, the @nodes array is empty, even though the tests I have working along the lines of the snippet in my OP are passing and the header is visible in the HTML. I then tried the "reader" module, thus:

      use XML::LibXML::Reader; my $reader = XML::LibXML::Reader->new(string => $html, recover => 2); while ($reader->read) { processNode($reader); } sub processNode { my $reader = shift; printf "%d %d %s %s\n", ($reader->depth, $reader->nodeType, $reader->name, $reader->value); }

      This starts off well enough, but crashes (I'm showing only the last printed info):

      7 8 #comment The above 3 meta tags *must* come first in the head; any + other head content must come *after* these tags Entity: line 21: parser error : Opening and ending tag mismatch: link +line 20 and head </head> ^

      I promise you there is no mismatch on the head tag, although there are "meta" and "link" tags between the last reported line and the closing head tag. Again I am having problems with the documentation, as https://metacpan.org/pod/distribution/XML-LibXML/lib/XML/LibXML/Parser.pod gives no information that I can see on how to get data out of the object. I suspect that there are things in the HTML that are beyond the powers of the XML suite even though they are validated. But not being able to see how to check means that I am far from sure.

      Any suggestions would be most welcome.

      Regards,

      John Davies

        Unfortunately, libxml2's HTML Parser doesn't support HTML5. If you want to use XML::LibXML, you need to switch to XHTML.

        XML::LibXML::Reader is a pull parser. It's used to process large XML documents that don't fit into memory. It interpreted the document as XML and didn't find a closing tag for the link element (as it's not needed in HTML). The documentation doesn't mention how to tell it to process HTML instead of XML, but I guess it doesn't support HTML5, either.

        See HTML::HTML5::Parser for an alternative (I haven't tried it myself).

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Testing generated HTML
by shmem (Chancellor) on Feb 21, 2016 at 21:06 UTC
    X: I am writing something that uses Dancer2 and Template::Toolkit to produce HTML. I want to write tests along the lines of "is the fourth column of the third row of the second table what I expect?".

    I'd use XML::XPath for that. XPATH expressions are straight-forward, the element hierarchy is the xpath itself, and elements can be indexed.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'