johnwashburn has asked for the wisdom of the Perl Monks concerning the following question:

I am have trouble traversing an HTML created with HTML::TreeBuilder using the right() function of HTML::Element.

Why is the return value of HTML::Element->right() not a reference to an HTML::Element object, but is instead a scalar?

I was hoping to parse a set of web pages to extract some genealogical data and ran into the traversal problem using the right() function of HTML::Element.

I use the look_down() function to find the HTML nodes of the tree which are the section titles of the page. From these nodes I was hoping to "walk" to the right with the HTML::Element->right() function.

I expected the function, HTML::Element->right(), to return a reference to an HTML::Element object. Insead it returns a scalar.

Here is the code
#!/usr/bin/perl -w # # ******************************************** use strict; use Carp; use Switch; use Data::Dumper; use Cwd; use HTTP::Request; use HTTP::Request::Common; use HTTP::Status; use LWP; use LWP::UserAgent; use HTML::TreeBuilder; my $FamilyPageURL = "http://e-familytree.net/F248/F248347.htm"; my $ua = LWP::UserAgent->new; if (defined $ua) { $ua->timeout(5); my $HTTP_Response = $ua->get($FamilyPageURL); my $HTTP_Status = $HTTP_Response->message ; if ($HTTP_Response->is_success) { my $HTTP_FamilyPage = $HTTP_Response->content; #Set up parser to parse this HTML Page # See: http://search.cpan.org/dist/HTML-Parser/Parser.pm # See: http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuild +er.pm # See: http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Element.p +m # See: http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scan +ning.pod # # HTML::TreeBuilder is a subclass of HTML::Parser. # Set the Parser portions to control how the HTML is parsed in +to a tree my $PageAsTree = HTML::TreeBuilder->new(); $PageAsTree->backquote( 1) ; $PageAsTree->empty_element_tags( 1 ) ; $PageAsTree->utf8_mode( 1); $PageAsTree->xml_mode( 1); $PageAsTree->warn( 1); $PageAsTree->ignore_elements(qw(script style)); # Parse the HTML is parsed into a tree, signal the end of proc +essin and remove the ability to parse more $PageAsTree->parse_content($HTTP_FamilyPage); $PageAsTree->elementify() ; $PageAsTree->normalize_content(); my $HtmlHead = $PageAsTree->look_down('_tag', 'head'); my $HtmlBody = $PageAsTree->look_down('_tag', 'body'); # Traverse and examine the tree for Sections of the page (Husb +and, Wife, Children, Notes) my @PageSections = $HtmlBody->look_down( sub { + return (($_[0]->tag() eq 'div' ) + and ($_[0]->attr('class') =~ m/^secTitle$/i)); + } + ); # searches for Husband, Wife and Children Sections. # transfer the HTML to the right of this node of HTML up until + the next Section Title is found my %FamilyMember; foreach my $Node (@PageSections) { my $SectionTitle = $Node->content->[0]; my @SectionNodes = (); my @NodesOnRight = $Node->right; my $NumNodesOnTheRight = @NodesOnRight; print "\$NumNodesOnTheRight = $NumNodesOnTheRight\n @Nodes +OnRight\n"; my $RightNode1 = $Node->right; print "The node to the right of " . $Node->as_HTML . "is [ +" . $RightNode1 . "]"; print "[" . $RightNode1->as_HTML . "]"; print "\n"; for( my $RightNode = $Node->right; (defined $RightNode) && ($RightNode->attr('class') !~ +m/^secTitle$/i); $RightNode = $RightNode->right) { push @SectionNodes, $RightNode; } $FamilyMember{$SectionTitle} = \@SectionNodes; } } }

I was hoping to use the for loop:
for( my $RightNode = $Node->right; (defined $RightNode) && ($RightNode->attr('class') !~ m/^secTitle$/i); $RightNode = $RightNode->right) { }

To do the actual traversal of the HTML Tree, but the loop fell down on the expression: $RightNode->attr('class') because $RightNode is not a reference to an HTML::Element.

The lines above the loop:
my @NodesOnRight = $Node->right; my $NumNodesOnTheRight = @NodesOnRight; print "\$NumNodesOnTheRight = $NumNodesOnTheRight\n @NodesOnRight\n"; my $RightNode1 = $Node->right; print "The node to the right of " . $Node->as_HTML . "is [" . $RightNo +de1 . "]"; print "[" . $RightNode1->as_HTML . "]"; print "\n";

demonstrate the function works when the return is an array context, but not when the return is a scalar context.

The result of executing the above code is:
$NumNodesOnTheRight = 51 HTML::Element=HASH(0x30f5344) Other Spouses: HTML::Element=HASH(0x30f86d4) HTML::Element=HASH(0x30f +8724) Father: HTML::Element=HASH(0x30f87a4) Mother: HTML::Element=HASH(0x30f8834) HTML::Element=HASH(0x30f8894) HTML::Element=HASH(0x30f8904) HTML::Element=HASH(0x30f89f4) Born: Died: Bef 21 Nov 1717 Father: HTML::Element=HASH(0x30f8bb4) Mother: HTML::Element=HASH(0x30f8c44) HTML::Element=HASH(0x30f8ca4) HTML::Element=HASH(0x30f8d14) HTML::Element=HASH(0x30f8e04) Born: Bet 1695 and 1723 Died: Bet 1748 and 1808 Wife: HTML::Element=HASH(0x30f8f44) HTML::Element=HASH(0x30f8fa4) HTML::Element=HASH(0x30f9084) Born: Abt 1702 Died: 14 Oct 1783 at Bridgewater, Plymouth, MA Husband: HTML::Element=HASH(0x30f91c4) HTML::Element=HASH(0x30f9224) HTML::Element=HASH(0x30f9304) Born: Abt 1704 Died: HTML::Element=HASH(0x30f93a4) HTML::Element=HASH(0x30f9484) Born: Bef 1710 Died: 1793 at Marlborough, MA Husband: HTML::Element=HASH(0x30fcfc4) HTML::Element=HASH(0x30fd024) HTML::Element=HASH(0x30fd104) Born: Abt 1710 Died: Husband: HTML::Element=HASH(0x30fd234) HTML::Element=HASH(0x30fd294) HTML::Element=HASH(0x30fd304) The node to the right of <div class="secTitle">HUSBAND</div> is [ ] Can't call method "as_HTML" without a package or object reference at T +raverseTree.pl line 75.

The documentation for the href="http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Element.pm">HTML::Element->right() function states:
In scalar context: returns the node that's the immediate right sibling of $h. If $h is the rightmost (or only) child of its parent (or has no parent), then this returns undef.

In list context: returns all the nodes that're the right siblings of $h, starting with the leftmost. If $h is the rightmost (or only) child of its parent (or has no parent), then this returns empty-list.


Since there are 51 nodes of HTML to the right of the first section of the page ($Node), I expected
my $RightNode = $Node->right; my $NextRightNode = $RightNode->right;

to work and was surprise when it did not.

What am I missing here?

Replies are listed 'Best First'.
Re: Traversing an HTMLTree with HTML:Element ->right
by wfsp (Abbot) on Jun 30, 2009 at 09:04 UTC
    Update: This is all wrong. :-( I misread the HTML.

    See my second attempt.

    This might help illustrate "siblings".

    #!/usr/bin/perl use warnings; use strict; use lib q{/www/lib}; use SW::Debug; use HTML::TreeBuilder; my $content = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content($content) or die qq{cant build tree}; my $body = $t->look_down(_tag => q{body}); my $p = $t->look_down(_tag => q{p}); my @right = $p->right; print scalar @right, qq{ siblings\n}; for my $ele (@right){ $ele->dump; print q{-} x 20, qq{\n}; } __DATA__ <p id = "1">one</p> <p id = "2">two</p> <p id = "3">three</p>
    2 siblings <p id="2"> @0.1.1 "two" ---------- <p id="3"> @0.1.2 "three" ----------
    Which I don't think is what you're after. In the
    foreach my $Node (@PageSections)
    loop you want to look down for divs with class = "vcard" and then look down into each of those to get the data you need.

    The "vcard" divs are children of the "secTitle" divs rather then siblings.

    HTH

      Here's another go. :-)
      #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(q{html/monk.html}) or die qq{cant build tree}; my $body = $t->look_down(_tag => q{body}); my @PageSections = $body->look_down( _tag => q{div}, class => q{secTit +le}); my $i; for my $node (@PageSections){ my $secTitle = $node->as_text; print qq{>>>>> $secTitle\n}; my @right = $node->right; for my $ele (@right){ if (ref $ele){ last if( $ele->tag eq q{div} and $ele->attr(q{class}) and $ele->attr(q{class}) eq q{secTitle} ); $ele->dump; } else{ print $ele, qq{\n}; } } print q{-} x 20, qq{\n}; }
        Re: The "vcard" divs are children of the "secTitle" divs rather then siblings.

        Oh how I wish that were true.
        Here is a a snippet of the Husband portion of the page.
        I think my problem is that I my iteration with ->right() assumes the ->content array of the parent node contains only references to HTML::Element objects. This assumption is wrong. The ->content array of an HTML::Element is a mix of text scalars and references. As your code (with the use of if ref() demonstrates). That this is not the case is demonstrated by this HTML snippet the test page
        <span class='BogusClasNameForThisExample'> Father: <a href="../F247/F247134.htm"> John Leonard </a> Mother: <a href="../F247/F247134.htm"> Sarah Leonard </a> </span>
        There are 4 elements in the ->content array of the span node:
        • Scalar text: Father
        • HTML::Element for <a href="../F247/F247134.htm">
        • Scalar text: Mother
        • HTML::Element for <a href="../F247/F247134.htm">
        I think I need to use the ->objectify_text() and \->deobjectify_text(). This should pack up the text as HTML::Elements so iterating with the ->right() function should then work.

        I will let you know the results when I get home tonight.
Re: Traversing an HTMLTree with HTML:Element ->right
by wfsp (Abbot) on Jun 30, 2009 at 14:08 UTC
    This has been annoying me. :-)
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; $Data::Dumper::Indent = 1; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(q{html/monk.html}) or die qq{cant build tree}; my $pre = $t->look_down( _tag => q{pre}, class => q{preElement} ); my @divs = $pre->look_down( _tag => q{div} ); my %table; my ($type, $record); for my $div (@divs){ if ($div->attr(q{class}) eq q{secTitle}){ $type = $div->as_text; next; } $record++; my @spans = $div->look_down( _tag => q{span}, class => qr/^x|f/ ); for my $span (@spans){ my $class = $span->attr(q{class}); my $txt = $span->as_text; $table{$type}{$record}{$class} = $txt; } my $trailing_txt = trailing_text($div); $table{$type}{$record}{family_data} = $trailing_txt; } print Dumper \%table; sub trailing_text { my ($div) = @_; my @rights = $div->right; my @txt; for my $right (@rights){ if (ref $right){ last if $right->tag eq q{div}; next if $right->tag eq q{br}; my $t = $right->as_text; next unless $t =~ /\S/; push @txt, trim($t); } else{ next unless $right =~ /\S/; push @txt, trim($right); } } return join(q{ }, @txt); } sub trim{ for (@_){ s/^\s+//; s/\s+$//; s/\s+/ /g; } return wantarray?@_:$_[0]; }
    $VAR1 = { 'WIFE' => { '2' => { 'fn n' => 'Marjoram Washburn', 'family_data' => 'Born: Died: Bef 21 Nov 1717 Father: Philip Was +hburn Mother: Elizabeth Irish' } }, 'CHILDREN' => { '6' => { 'fn n' => 'Mary Leonard', 'family_data' => 'Born: Bef 1710 Died: 1793 at Marlborough, MA H +usband: Daniel Herrington' }, '4' => { 'fn n' => 'Elizabeth Leonard', 'family_data' => 'Born: Abt 1702 Died: 14 Oct 1783 at Bridgewate +r, Plymouth, MA Husband: James Washburn' }, '3' => { 'fn n' => 'John Leonard', 'family_data' => 'Born: Bet 1695 and 1723 Died: Bet 1748 and 180 +8 Wife: Anna Noble' }, '7' => { 'fn n' => 'Margene Leonard', 'family_data' => 'Born: Abt 1710 Died: Husband: Nathaniel Pratt' }, '5' => { 'fn n' => 'Josiah Leonard', 'family_data' => 'Born: Abt 1704 Died:' } }, 'HUSBAND' => { '1' => { 'x-marriage-date' => '1699-11-2', 'fn n' => 'Josiah Leonard', 'family_data' => 'Other Spouses: Abigail Washburn Father: John L +eonard Mother: Sarah Leonard', 'x-gender' => 'Male', 'x-death-date' => '1745-1-1', 'x-death-location' => 'Bridgewater, Plymouth, MA', 'x-marriage-location' => 'Bridgewater, Plymouth, MA' } } };
Re: Traversing an HTMLTree with HTML:Element ->right
by johnwashburn (Sexton) on Jul 17, 2009 at 11:19 UTC
    I hate problem discussions where the initial poster has obviously solved the initial problem, but failed to share the solution found. In an effort to NOT be that guy, here is the solution I eventually settled on.

    I would especially like to than, WFSP. Among other things, I could not resist replacing my scalar-only version of trim() with your more elegant version of trim().

    My main problem was the faulty assumtion that HTML::Element->content_list is an array of references to HTML::Element. It is NOT. Moreover, the documentation for HTML::Element and HTML::TreeBuilder very clearly states that the array is a mix of references and scalars.

    If you want all the elements of the HTML::Element->content_list array to be references to HTML::Element, then you need to use the objectify_text() method.

    I wanted the genealogical data to all be under the vcard so I could look_down to vcards during the main traversal/extraction. So I decided to re-structure the tree prior to the main traversal/extract. Re-structuring the tree before my main traversal made the main traversal simpler and more robust.

    But once you do some pre-traversal re-structuring, there always more to do.

    For those interested, here is the solution I settled on