HTML and Xpath

way has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I'm using HTML::TreeBuilder::XPath to extract data from an html page, i can't understand very well how it work, basically i want to get the value inside of "<div class="here">" but file by file, i've made an example based in the documentation but doesn't work, check below:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_content(do { local($/); <DATA>});

for my $result ($tree->findnodes(q{/html/body/div})) {

    print $result->findvalue(q{//div[@class="here"]});
    print "<br>".("-" x 120)."<br>";

}

__DATA__;
<html>
    <body>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
    </body>
</html>
[download]

It print this:

this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
this's the valuethis's the valuethis's the valuethis's the valuethis's
+ the valuethis's the value
----------------------------------------------------------------------
+--------------------------------------------------
[download]

So, the solution for me was made this:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_content(do { local($/); <DATA>});

for my $result ($tree->findnodes(q{/html/body/div})) {

    my $x = HTML::TreeBuilder::XPath->new;
        $x->parse($result->as_HTML);
        print $x->findvalue(q{//div[@class="here"]});
    print "<br>".("-" x 17)."<br>";

}

__DATA__;
<html>
    <body>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
        <div>
            <div class="here">this's the value</div>
        </div>
    </body>
</html>
[download]

It print this:

this's the value
-----------------
this's the value
-----------------
this's the value
-----------------
this's the value
-----------------
this's the value
-----------------
this's the value
-----------------
[download]

But i think, isn't pretty code, what's the correct way to do this, what's wrong in the first example?

Thank you in advance

Comment on HTML and Xpath Select or Download Code

Replies are listed 'Best First'.
Re: HTML and Xpath by mirod (Canon) on Nov 06, 2008 at 15:04 UTC
There is no need for 2 loops, you can select the elements you want with just a single XPath expression. The code should look like this: `for my $result ($tree->findnodes(q{/html/body/div/div[@class="here"]}) +) { print $result->as_text; print "\n<br>".("-" x 120)."<br>\n"; }` [download] Instead of `as_text` you may want to use `as_HTML`, or, if you want the inner HTML of the element (the HTML without the enclosing tag), something like `print map { ref $_ ? $_->as_HTML : $_ } $result->content_list;`	[reply] [d/l] [select]
Re^2: HTML and Xpath by way (Sexton) on Nov 06, 2008 at 16:05 UTC
Yes, i understand, but, for doesn't make a complicated example, I have left out some details, so it would not apply a single XPath expression, anyway, thank you very much, all the comments are always helpful.	[reply]
Re^3: HTML and Xpath by mirod (Canon) on Nov 06, 2008 at 16:43 UTC
OK, so what you are looking for is `findnodes_as_strings`, which indeed doesn't look like it is documented. I have to fix that. It returns a list of strings, one per node returned by the query. Thanks. update: duh! that method exists (and is documented) only in the development version of the module, I am uploading it to PAUSE right away, sorry for the inconvenience.	[reply] [d/l]
Re: HTML and Xpath by Anonymous Monk on Nov 06, 2008 at 15:02 UTC
Maybe an oversight, maybe a concious decision, but since each $result is a HTML::Element, you may need to detach it from the tree, seems to work #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <html> <body> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> <div> <div class="here">this's the value</div> </div> </body> </html> __HTML__ for my $result ($tree->findnodes(q{/html/body/div})) { print "$result ", $result->detach(),"\n<br>\n"; print $result->findvalue(q{//div[@class="here"]}); print "\n<br>\n\n"; } __END__ HTML::Element=HASH(0x1b7897c) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78a48) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78afc) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78bb0) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78c64) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> HTML::Element=HASH(0x1b78d18) HTML::Element=HASH(0x1b787e4) <br> this's the value <br> [download]	[reply] [d/l]
Re^2: HTML and Xpath by Anonymous Monk on Nov 06, 2008 at 15:09 UTC
Instead of detaching (which modifies tree), you could clone `print $result->clone->findvalue(q{//div[@class="here"]});` [download]	[reply] [d/l]
Re^3: HTML and Xpath by way (Sexton) on Nov 06, 2008 at 15:56 UTC
Yes, i test and both method works, "clone" is really a good option too, thank you	[reply]
Re^2: HTML and Xpath by Anonymous Monk on Aug 17, 2009 at 11:10 UTC
Instead of detaching, you could also `print $result->findvalue(q{./div[@class="here"]});` [download] or `print $result->findvalue(q{div[@class="here"]});` [download]	[reply] [d/l] [select]
Re^2: HTML and Xpath by way (Sexton) on Nov 06, 2008 at 15:53 UTC
Thank you very much, as you told, detaching it from the tree works perfectly. Thk U again	[reply]
Re: HTML and Xpath by ikegami (Patriarch) on Nov 06, 2008 at 20:38 UTC
`//div[@class="here"]` is short for `/descendant::div[@class="here"]` The leading "/" means the root. Since you want a relative path rather than an absolute path, what you want is `descendant::div[@class="here"]` In this case, `child::div[@class="here"]` would also do, and that can be abbreviated to `div[@class="here"]` `detach` and `clone` work because the detached/cloned element becomes the root of the detached/cloned tree. In that situation, there is no difference between a relative path an an absolute path. However, they are needlessly expensive, they prevent you from looking at the node's ancestors (since you removed them) and `detach` destroys the tree. I wouldn't use either of those solution.	[reply] [d/l] [select]
Re^2: HTML and Xpath by way (Sexton) on Nov 15, 2008 at 20:12 UTC
Yes!!! it's work perfect, cheers!!	[reply]