way has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I'm using HTML::TreeBuilder::XPath to extract data from an html page, i can't understand very well how it work, basically i want to get the value inside of "<div class="here">" but file by file, i've made an example based in the documentation but doesn't work, check below:

use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(do { local($/); <DATA>}); for my $result ($tree->findnodes(q{/html/body/div})) { print $result->findvalue(q{//div[@class="here"]}); print "<br>".("-" x 120)."<br>"; } __DATA__; <html> <body> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> </body> </html>

It print this:

this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +-------------------------------------------------- this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +-------------------------------------------------- this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +-------------------------------------------------- this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +-------------------------------------------------- this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +-------------------------------------------------- this's the valuethis's the valuethis's the valuethis's the valuethis's + the valuethis's the value ---------------------------------------------------------------------- +--------------------------------------------------

So, the solution for me was made this:

use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(do { local($/); <DATA>}); for my $result ($tree->findnodes(q{/html/body/div})) { my $x = HTML::TreeBuilder::XPath->new; $x->parse($result->as_HTML); print $x->findvalue(q{//div[@class="here"]}); print "<br>".("-" x 17)."<br>"; } __DATA__; <html> <body> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> </body> </html>

It print this:

this's the value ----------------- this's the value ----------------- this's the value ----------------- this's the value ----------------- this's the value ----------------- this's the value -----------------

But i think, isn't pretty code, what's the correct way to do this, what's wrong in the first example?

Thank you in advance

Replies are listed 'Best First'.
Re: HTML and Xpath
by mirod (Canon) on Nov 06, 2008 at 15:04 UTC

    There is no need for 2 loops, you can select the elements you want with just a single XPath expression. The code should look like this:

    for my $result ($tree->findnodes(q{/html/body/div/div[@class="here"]}) +) { print $result->as_text; print "\n<br>".("-" x 120)."<br>\n"; }

    Instead of as_text you may want to use as_HTML, or, if you want the inner HTML of the element (the HTML without the enclosing tag), something like print map { ref $_ ? $_->as_HTML : $_ } $result->content_list;

      Yes, i understand, but, for doesn't make a complicated example, I have left out some details, so it would not apply a single XPath expression, anyway, thank you very much, all the comments are always helpful.

        OK, so what you are looking for is findnodes_as_strings, which indeed doesn't look like it is documented. I have to fix that. It returns a list of strings, one per node returned by the query. Thanks.

        update: duh! that method exists (and is documented) only in the development version of the module, I am uploading it to PAUSE right away, sorry for the inconvenience.

Re: HTML and Xpath
by Anonymous Monk on Nov 06, 2008 at 15:02 UTC
    Maybe an oversight, maybe a concious decision, but since each $result is a HTML::Element, you may need to detach it from the tree, seems to work
    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <html> <body> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> </body> </html> __HTML__ for my $result ($tree->findnodes(q{/html/body/div})) { print "$result ", $result->detach(),"\n<br>\n"; print $result->findvalue(q{//div[@class="here"]}); print "\n<br>\n\n"; } __END__ HTML::Element=HASH(0x1b7897c) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78a48) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78afc) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78bb0) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78c64) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78d18) HTML::Element=HASH(0x1b787e4) <br> this's the value <br>
      Instead of detaching (which modifies tree), you could clone
      print $result->clone->findvalue(q{//div[@class="here"]});

        Yes, i test and both method works, "clone" is really a good option too, thank you

      Instead of detaching, you could also
      print $result->findvalue(q{./div[@class="here"]});
      or
      print $result->findvalue(q{div[@class="here"]});

      Thank you very much, as you told, detaching it from the tree works perfectly.

      Thk U again

Re: HTML and Xpath
by ikegami (Patriarch) on Nov 06, 2008 at 20:38 UTC

    //div[@class="here"]
    is short for
    /descendant::div[@class="here"]
    The leading "/" means the root. Since you want a relative path rather than an absolute path, what you want is
    descendant::div[@class="here"]

    In this case,
    child::div[@class="here"]
    would also do, and that can be abbreviated to
    div[@class="here"]

    detach and clone work because the detached/cloned element becomes the root of the detached/cloned tree. In that situation, there is no difference between a relative path an an absolute path. However, they are needlessly expensive, they prevent you from looking at the node's ancestors (since you removed them) and detach destroys the tree. I wouldn't use either of those solution.

      Yes!!! it's work perfect, cheers!!