vxp has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I do a lot (and I mean - A LOT) of HTML parsing, for various reasons. I use the HTML::TreeBuilder and HTML::TokeParser for these purposes.

what would be REALLY useful, for me, is a "flowchart", so to speak, of an HTML file that I am about to parse. I'll explain what I mean in a bit.

The thing is that my brain seems to process things visually much better than by thinking about HTML tags and arranging them in my head. So, for instance, take this HTML (straight out of HTML::TreeBuilder's documentation):

<ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul>

The TreeBuilder will construct the following tree out of it:

<html> @0 (IMPLICIT) <head> @0.0 (IMPLICIT) <body> @0.1 (IMPLICIT) <ul> @0.1.0 <li> @0.1.0.0 "Ice cream." <li> @0.1.0.1 "Whipped cream. " <li> @0.1.0.2 "Hot apple pie " <br> @0.1.0.2.1 "(mmm pie)"

now, that's wonderful. beautiful. but as I said - I am more of a visual person. So at the moment what I do with this stuff - is I draw that damn flowchart by hand. So in the example above, it'd look like this:

rectangle at the top, with "html" written in it. below that, I'd have two siblings - a rectangle with "head" in it and a rectangle with "body" in it. under body would be a rectangle with "ul" and under "ul" would be a rectangle with "li" in it. and under "li" would be a string ("ice cream").

You get the idea, I hope. The flowchart helps with the visualization of the document's tree. that makes it very easy to come up with an algorithm to rip out whatever contents I need from that tree.

So, my question is - is there a perl script (or anything else, really. i don't care if its perl or not, although it'd be awesome if it was a perl solution) that I can feed an HTML file into, and it'd produce that flowchart that I described above?

Thanks for any comments / suggestions.

Replies are listed 'Best First'.
Re: HTML tree - making a flowchart of it.
by merlyn (Sage) on Oct 28, 2009 at 14:22 UTC
    I wouldn't be surprised if there wasn't something in the GraphViz family in the CPAN for that already. And if not, converting an HTML tree into node links that can be fed into command-line graphviz is probably a half-hour of programming at most.

    -- Randal L. Schwartz, Perl hacker

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Re: HTML tree - making a flowchart of it.
by Fletch (Bishop) on Oct 28, 2009 at 14:47 UTC

    Not a Perl solution, but something like Firebug or Safari's builtin inspector provide an interactive DOM tree which you can poke and prod. Also possibly of interest is the SelectorGadget bookmarklet which will let you click on a rendered HTML page and get CSS and/or XPath selectors interactively.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.