I've created a program that will parse an HTML file using HTML::Parser and generate a tree structure from it. Here is my current working code.

parsehtml.pl
============
#!/usr/bin/perl use strict; use warnings; use HTML::Parser (); my $htmltree = [ { tag => 'document', content => [] } ]; my $node = $htmltree->[0]->{content}; my @prevnodes = ($htmltree); sub start { my $tagname = shift; my $attr = shift; my $newnode = {}; $newnode->{tag} = $tagname; foreach my $key(keys %{$attr}) { $newnode->{$key} = $attr->{$key}; } $newnode->{content} = []; push @prevnodes, $node; push @{$node}, $newnode; $node = $newnode->{content}; } sub end { my $tagname = shift; $node = pop @prevnodes; } sub text { my $text = shift; chomp $text; if($text ne '') { push @{$node}, $text; } } my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], text_h => [\&text, "dtext"] ); $p->parse_file("test.html");

test.html
=========
<table id="maintable" width="300"> <tr> <td width="200">some content</td> <td width="100">more content</td> </tr> </table>
Now for the next challenge. I need to be able to know where I am in the tree structure for any node that I am in while I am walking it. I will pass along a value via CGI in the form of '0.0.2.1.2' which another script will translate as: '$htmltree->[0]->{content}->[0]->{content}->[2]->{content}->[1]->{content}->[2]'. Using the above code, and the following code I wrote for walking the tree and generating HTML from it, how can I mark each outputted HTML tag with its position in the tree?
sub descend_htmltree { my $node = shift; my $withclickiness = shift || 0; foreach my $tmpnode (@{$node}) { if(ref($tmpnode) eq 'HASH') { my $nodeid = ""; # Magic code to generate node's position in tre +e $htmloutput .= "<div style='border: thin solid #bbbbbb' onDblCli +ck=\"alert('you clicked $nodeid')\">" if($withclickiness); $htmloutput .= "<$tmpnode->{tag}"; foreach(keys %{$tmpnode}) { $htmloutput .= " $_=\"$tmpnode->{$_}\"" if($_ ne 'tag' && $_ n +e 'content'); } $htmloutput .= ">"; descend_htmltree($tmpnode->{content}); $htmloutput .= "</$tmpnode->{tag}>"; $htmloutput .= "</div>" if($withclickiness); } else { $htmloutput .= "$tmpnode"; } } } sub htmltree_to_html { my $filename = shift || ''; my $withclickiness = shift || 0; descend_htmltree($htmltree->[0]->{content}, $withclickiness); if($filename ne '') { open HTML, "> $filename" or die "Can't open $filename for HTML out +put"; print HTML $htmloutput; close HTML; } return $htmloutput; }

In reply to tracking where I am in a tree structure by agaffney

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.