agaffney has asked for the wisdom of the Perl Monks concerning the following question:

I've created a program that will parse an HTML file using HTML::Parser and generate a tree structure from it. Here is my current working code.

parsehtml.pl
============
#!/usr/bin/perl use strict; use warnings; use HTML::Parser (); my $htmltree = [ { tag => 'document', content => [] } ]; my $node = $htmltree->[0]->{content}; my @prevnodes = ($htmltree); sub start { my $tagname = shift; my $attr = shift; my $newnode = {}; $newnode->{tag} = $tagname; foreach my $key(keys %{$attr}) { $newnode->{$key} = $attr->{$key}; } $newnode->{content} = []; push @prevnodes, $node; push @{$node}, $newnode; $node = $newnode->{content}; } sub end { my $tagname = shift; $node = pop @prevnodes; } sub text { my $text = shift; chomp $text; if($text ne '') { push @{$node}, $text; } } my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], text_h => [\&text, "dtext"] ); $p->parse_file("test.html");

test.html
=========
<table id="maintable" width="300"> <tr> <td width="200">some content</td> <td width="100">more content</td> </tr> </table>
Now for the next challenge. I need to be able to know where I am in the tree structure for any node that I am in while I am walking it. I will pass along a value via CGI in the form of '0.0.2.1.2' which another script will translate as: '$htmltree->[0]->{content}->[0]->{content}->[2]->{content}->[1]->{content}->[2]'. Using the above code, and the following code I wrote for walking the tree and generating HTML from it, how can I mark each outputted HTML tag with its position in the tree?
sub descend_htmltree { my $node = shift; my $withclickiness = shift || 0; foreach my $tmpnode (@{$node}) { if(ref($tmpnode) eq 'HASH') { my $nodeid = ""; # Magic code to generate node's position in tre +e $htmloutput .= "<div style='border: thin solid #bbbbbb' onDblCli +ck=\"alert('you clicked $nodeid')\">" if($withclickiness); $htmloutput .= "<$tmpnode->{tag}"; foreach(keys %{$tmpnode}) { $htmloutput .= " $_=\"$tmpnode->{$_}\"" if($_ ne 'tag' && $_ n +e 'content'); } $htmloutput .= ">"; descend_htmltree($tmpnode->{content}); $htmloutput .= "</$tmpnode->{tag}>"; $htmloutput .= "</div>" if($withclickiness); } else { $htmloutput .= "$tmpnode"; } } } sub htmltree_to_html { my $filename = shift || ''; my $withclickiness = shift || 0; descend_htmltree($htmltree->[0]->{content}, $withclickiness); if($filename ne '') { open HTML, "> $filename" or die "Can't open $filename for HTML out +put"; print HTML $htmloutput; close HTML; } return $htmloutput; }

Replies are listed 'Best First'.
Re: tracking where I am in a tree structure
by Fletch (Bishop) on Jul 22, 2004 at 19:48 UTC

    You might want to look at HTML::TreeBuilder and HTML::Element which have a method $elt->address() which already does this.

      I've already got this code written and I don't have a lot of time to rewrite the thing. I'd like to just build in the functionality to my existing code.
Re: tracking where I am in a tree structure
by stvn (Monsignor) on Jul 22, 2004 at 21:49 UTC

    I think what you are going to want to do is to make the $withclickiness variable no longer optional in descend_htmltree, since you are checking it in htmltree_to_html you really dont need to anyway as it will already be set or 0.

    Also, I think you actually meant to be passing that along your recursive call to descend_htmltree anyway, otherwise it would only do it for the top node.

    Then once this is all in place, add a third variable, which will be the node-id you are looking for. Each time you pass the $node_id to the next recursive call, it will get appended with the $node_counter, which is just the depth in the current child group. This should then produce the node_id you are looking for.

    Here is some code, I could not test is as you supplied no test data to test against. But it should work, let me know if you have any problems with it.

    sub descend_htmltree {   my $node = shift;   my $withclickiness = shift; my $node_id = shift; my $node_counter = 0;   foreach my $tmpnode (@{$node}) { $node_counter++     if(ref($tmpnode) eq 'HASH') {       my $nodeid = "${node_id}.$node_counter"; # Magic code to generat +e node's position in tree       $htmloutput .= "<div style='border: thin solid #bbbbbb' onDblCli +ck=\"alert('you clicked $nodeid')\">" if($withclickiness);       $htmloutput .= "<$tmpnode->{tag}";       foreach(keys %{$tmpnode}) {         $htmloutput .= " $_=\"$tmpnode->{$_}\"" if($_ ne 'tag' && $_ n +e 'content');       }       $htmloutput .= ">";       descend_htmltree($tmpnode->{content}, $withclickiness, $current_ +node_id);       $htmloutput .= "</$tmpnode->{tag}>";       $htmloutput .= "</div>" if($withclickiness);     } else {       $htmloutput .= "$tmpnode";     }   } }
    One another note, I am not 100% sure your HTML, in particular the onClick part of the DIV tag will work properly. The browser may see the entire contents of the DIV (it and all its subtrees) as all part of the same HTML container.

    -stvn
      That was the type of thing I was trying to put together in my head, but for some reason it just wouldn't come together. That code seems to work perfectly, although, I haven't tested it more than 1 level deep, yet.

      As for the DIV problem, how would you suggest fixing that?
        That was the type of thing I was trying to put together in my head, but for some reason it just wouldn't come together.

        Recursive solutions always are harder to wrap your head around, they tend require more "faith" than iterative solutions, and that is not something most programmers are usually accustomed too (all relgious meanings aside of course).

        As for the DIV problem, how would you suggest fixing that?

        Well not having seen your output I am not sure, but I would suspect you could try putting the 'onClick' handler in the tag you are writing, rather than wrapping that tag. Although this may have the same effect when you are dealing with things like TABLE, TR, UL tags, since they too are container tags which sometimes "enclose" their descendents. The other option, and I am not sure if this would be appropriate or not, is to only wrap your leaf nodes with the "onClick" since they are likely to be text nodes and the like (although things like BR and HR will fall into that category too). Hard to really say without either having some test data, or knowing more about what you want the UI to do in the end.

        -stvn