Hi.

I do a lot (and I mean - A LOT) of HTML parsing, for various reasons. I use the HTML::TreeBuilder and HTML::TokeParser for these purposes.

what would be REALLY useful, for me, is a "flowchart", so to speak, of an HTML file that I am about to parse. I'll explain what I mean in a bit.

The thing is that my brain seems to process things visually much better than by thinking about HTML tags and arranging them in my head. So, for instance, take this HTML (straight out of HTML::TreeBuilder's documentation):

<ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul>

The TreeBuilder will construct the following tree out of it:

<html> @0 (IMPLICIT) <head> @0.0 (IMPLICIT) <body> @0.1 (IMPLICIT) <ul> @0.1.0 <li> @0.1.0.0 "Ice cream." <li> @0.1.0.1 "Whipped cream. " <li> @0.1.0.2 "Hot apple pie " <br> @0.1.0.2.1 "(mmm pie)"

now, that's wonderful. beautiful. but as I said - I am more of a visual person. So at the moment what I do with this stuff - is I draw that damn flowchart by hand. So in the example above, it'd look like this:

rectangle at the top, with "html" written in it. below that, I'd have two siblings - a rectangle with "head" in it and a rectangle with "body" in it. under body would be a rectangle with "ul" and under "ul" would be a rectangle with "li" in it. and under "li" would be a string ("ice cream").

You get the idea, I hope. The flowchart helps with the visualization of the document's tree. that makes it very easy to come up with an algorithm to rip out whatever contents I need from that tree.

So, my question is - is there a perl script (or anything else, really. i don't care if its perl or not, although it'd be awesome if it was a perl solution) that I can feed an HTML file into, and it'd produce that flowchart that I described above?

Thanks for any comments / suggestions.


In reply to HTML tree - making a flowchart of it. by vxp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.