initself has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing a web page using HTML::TreeBuilder. I succesfully traversed the page using look_down() to find all links with a 'class' attribute equal to "mailtext". Above each of these links in the tree is a link without any attributes. I'd like to retrieve the 'href' from that link by using look_up(). In other words, for each element I find that matches my criteria, I'd like to traverse up one link in the tree and retreive the url. However, when I print the new element using all_attr(), it appears to be the same element I originally found in the tree ("mailtext"), not the new element one up in the tree.
sub get_birthday { my $content = shift; my $tree = HTML::TreeBuilder->new; $tree->parse($content); my @elements = $tree->look_down('_tag' => 'a'); for my $element (@elements) { my $class_tag = $element->attr_get_i('class'); if ($class_tag eq "mailtext") { my $subject = $element->as_trimmed_text(); my $subject_url = $element->attr_get_i('href'); print "<a href='$subject_url'>$subject</a>\n"; my $touchstone = $element->look_up('_tag' => 'a')->attr_get_i('h +ref'); print $element->look_up('_tag' => 'a')->all_attr() . "\n"; } } } # Sample Code <tr valign="top"> <td width=15 bgcolor="E8F1FA"><input type="checkbox" name="checker12 +0041225" value="120041225"></td> <td width=50 bgcolor="E8F1FA"> <span class="text">Apr 27, 2006 11:29 PM</span> </td> <td width=150 bgcolor="E8F1FA"> <table width="150" border="0" cellspacing="0" cellpadding="0" clas +s="imagetable"> <tr><td> <a href="http://profile.myspace.com/index.cfm?fuseaction=user. +viewprofile&friendID=3847879"> <img src="http://myspace-646.vo.llnwd.net/00242/64/63/24245364 +6_s.jpg" align="absmiddle"> </a> <span class="text"> <a href="http://profile.myspace.com/index.cfm?fuseaction=use +r.viewprofile&friendID=3847879">JASON FEDDY</a> </span> <DIV style="width:80px;height:20px;" ID="UserDataNode2" CLASS= +"DataPoint=OnlineNow;UserID=3847879;"></div> </td></tr> </table> </td> <td width=30 bgcolor="E8F1FA">Replied&nbsp;</td> <td width=220 bgcolor="E8F1FA"><a class="mailtext"href="http://mail. +myspace.com/index.cfm?fuseaction=mail.readmessage&messageID=120041225 +&type=inbox&status=new&Mytoken=8727234F-137A-9B6B-2CAEC273502E282F639 +3477">Hello mate</a> </td> </tr>
The result of $touchstone should be:
http://profile.myspace.com/index.cfm?fuseaction=use +r.viewprofile&friendID=3847879
as this is the url directly above the one above $subject_url . Am I misusing the function? Perhaps once you are working with a single Element, you not longer have access to the entire tree? Does $element then become the 'parent'?

Replies are listed 'Best First'.
Re: look_up() in HTML::Element Not Traversing As Expected
by GrandFather (Saint) on Apr 29, 2006 at 03:31 UTC

    If we clean up that HTML so that most of the cruft is removed we get:

    <tr valign="top"> <td> <table class="imagetable"> <tr> <td> <a href="URL1"><img src="IMG1"></a> <a href="URL2">JASON FEDDY</a> </td> </tr> </table> </td> <td>Replied&nbsp;</td> <td><a class="mailtext" href="URL3">Hello mate</a></td> </tr>

    Now, look at the HTML and notice that you are finding URL3, but want to navigate back to URL1. look_up can't do that. URL1 is not above URL3 in the element tree, it's in a completely different branch!

    You need to sit back and think about your rules for finding URL1 given URL3. There are a bunch of ways it could be done, but it depends a great deal on how the structure of the HTML can change. I don't think it is even worth giving a solution in this particular case because you really need to accommodate possible changes in the HTML and I have no idea what those may be. In the simplest case you can just navigate up using parent, then work your way back down indexing into the contents array. But that is mighty fragile!


    DWIM is Perl's answer to Gödel
      I want URL2, does that change anything? How can you tell where a branch ends and when a branch begins? I traversed the entire code to get all URL elements, so at one point there were all accessable to me using look_down().

        Nope. URL2 is still "up a few, over a couple, down a couple and over a couple". It's a sibling of URL1. Originally it was in a span element, but fo illustration purposes that doesn't matter.

        I hope you see, BTW, the virtue of cleaning up the sample data to the point where we are talking about only the relevant structure and simple data? You ought do this sort of thing pretty much whenever you have a problem to solve - remove the cruft and concentrate on the real problem.

        The real problem here is that the connection between the data you are matching and the data you want is rather tenuous so you have to make sure you understand exactly what that relationship is before you can write code to implement it. Here I think it is usefull to think in terms of parent/child and sibling relationships here.


        DWIM is Perl's answer to Gödel
Re: look_up() in HTML::Element Not Traversing As Expected
by GrandFather (Saint) on Apr 29, 2006 at 02:26 UTC

    It's hard to envisage what the HTML you are dealing with looks like from the code snippet. It would help us provide an answer if you could include sample HTML with the sample code. It would be even better if it were wrapped up like this:

    use strict; use warnings; use HTML::TreeBuilder; my $str = do {local $/; <DATA>}; get_birthday ($str); sub get_birthday { my $content = shift; my $tree = HTML::TreeBuilder->new; $tree->parse($content); my @elements = $tree->look_down('_tag' => 'a'); for my $element (@elements) { my $class_tag = $element->attr_get_i('class') || ''; if ($class_tag eq "mailtext") { my $subject = $element->as_trimmed_text(); my $subject_url = $element->attr_get_i('href'); print "<a href='$subject_url'>$subject</a>\n"; print $element->look_up('_tag' => 'a')->all_attr() . "\n"; } } } __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en"> <head> </head> <body> <p><a>Metalink<a class="mailtext" href='http://erewhon.com'>Erewhon</a +></a></p> </body> </html>

    Interestingly this generates a 'Use of uninitialized value' warning in line 16 then prints:

    <a href='http://erewhon.com'>Erewhon</a> 5/8

    which looks like a stringified hash to me. That may be a clue, but without something resembling your actual HTML it's a bit hard to say.


    DWIM is Perl's answer to Gödel
      I think your HTML would generate 'Use of uninitialized value' because you don't have any URLs above 'http://erewhon.com'.