saunderson has asked for the wisdom of the Perl Monks concerning the following question:

Used Version of HTML::TreeBuilder::XPath is 0.11

Hi,

i can't figure out why the following script doesn't return anything. The script makes an HTML::TreeBuilder::XPath object from the site http://docstore.mik.ua/orelly/perl4/cook/ch22_07.htm which is fetched via LWP::UserAgent. Then it tries to fetch a node of an absolute XPath and finally fails.

After repeated analysis of http://docstore.mik.ua/orelly/perl4/cook/ch22_07.htm i'm sure that the script has to output 22.6.1. Problem. Instead the script dies at line 18. Maybe someone of you experts can help me to find the error in my reasoning!?

Thanks a lot for your effort and let the wisdom be with you

best regards

s.
#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; use LWP::UserAgent; my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0"); my $req = HTTP::Request->new(GET => 'http://docstore.mik.ua/orelly/per +l4/cook/ch22_07.htm'); my $res = $ua->request($req); die("error") unless $res->is_success; my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content); my @node = $xp->findnodes_as_strings("/html/body/table[1]/tr/td[2]/di +v/a/h3"); die("node doesn't exist") if $#node == -1; # Line 18 print "$_\n" foreach (@node);

Replies are listed 'Best First'.
Re: can't extract node with HTML::TreeBuilder::XPath
by tobyink (Canon) on Jul 29, 2012 at 19:54 UTC

    Per spec, the xpath is wrong. Given the following HTML:

    <table> <tr> <td>Foo</td> </tr> </table>

    The correct xpath to select the table cell is along the lines of //table/tbody/tr/td. Yes, there's an invisible <tbody> element in there! A standards-compliant HTML parser will always insert the <tbody> tag for you if you miss it out.

    The /a/h3 part of the xpath is an interesting feature too. In HTML 4.x, <h3> is not a permitted child of <a>. What exactly to do when encountering such an element is undefined. Some parsers may close the <a> element early so that the <h3> ends up as a sibling of it rather than the child of it.

    But under HTML 5 rules, <h3> is a permitted child of <a>. What does HTML::TreeBuilder do? Who knows!? HTML::TreeBuilder's documentation is pretty vague.

    This is exactly the sort of reasons I maintain HTML::HTML5::Parser which is a fork of a third-party non-CPAN HTML5 parser, ported to run on top of XML::LibXML.

    FWIW, this works for me...

    use 5.010; use PerlX::MethodCallWithBlock; use Web::Magic -quotelike => 'web'; my @headings = web <http://docstore.mik.ua/orelly/perl4/cook/ch22_07.h +tm> -> assert_success -> querySelectorAll("h3") -> map { $_->textContent }; say $headings[0];
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      i was not aware that the html 4.x specs are so strict. I thought that chrome shows me the xpaths based on the present html file and does not create something new that is not found in the data but is then compliant with the html standards.

      So my interim conclusion is, that HTML::TreeBuilder doesn't care about any specs and just analyse the underlying html code. Which is straightforward for me, someone who doesn't care about any specs :) . But i got your point. Specs are essential to have a common basis, so a html parser with the specs in mind is always preferable.

      Thanks for your detailed explanation and your regards to HTML::HTML5::Parser

      What does HTML::TreeBuilder do? Who knows!?

      I KNOW! It tells you to read the source, how awful :)

      htmltreexpather.pl works rather well to spit out xpaths that TreeBuilder::XPath will like :)

        What does HTML::TreeBuilder do? Who knows!?

        I KNOW! It tells you to read the source, how awful :)
        I second that. A specs compatible HTML::TreeBuilder::XPath that works with the xpaths extracted with a common browser would definitely a simplification....
Re: can't extract node with HTML::TreeBuilder::XPath
by ww (Archbishop) on Jul 30, 2012 at 01:07 UTC
    Legitimate question; questionable site.

    The whole O'Reilly CD catalog? and the publisher mis-spelled? Highly suspect.

      you're right. I stumbled over it rather by accident as i searched for a perl specific xpath explanation. And therefore why not testing my code with this site :)

      I hope that linking this site is not misinterpreted...
Re: can't extract node with HTML::TreeBuilder::XPath
by saunderson (Novice) on Jul 29, 2012 at 19:31 UTC
    Meanwhile i solved the inconsistency. Further investigation yields for the xpath expression //table[1]/tr/td[2]/a/div/h3[\@class='sect2']. The reason of the error was that i relied to heavily on the output of 'XPath Helper' (a chrome extension). Is there any linux based browser which could help to extract a xpath of an website element correctly?

      I find using CSS selectors or relative XPath expressions to be more wieldly. Anchoring the expression on a "nearby" element with an id attribute is usually sufficient to get a short XPath (or CSS) expression.