I would be careful using this approach. It recently fell on my shoulders to accomplish a very similar task as the poster's, namely: cut out a text-footer which was embedded in a <td>, which was embedded in a <tr>, which was embedded in a <table>...and so on, and replace it with an SSI include...on 20,000 pages, which only conform to a very loose coding standard. Naturally the first thing that came to mind was some sort of tree data structure, since I could just prune the limbs and replace them for the desired effect. So naturally the second thing that came to mind was HTML::TreeBuilder.

I quickly discovered that this module is much more geared towards extracting information from an HTML file than altering one in-place. If you read the author's article in TPJ 19 you'll see as much. The module is really hampered by its lack of any semblance of an identity property. That is, in psuedocode, $document != HTML::TreeBuilder->new($document)->dump_html(); It doesn't preserve whitespace and is apt to change your code by throwing closing tags, etc. While this is all to spec for HTML, we all know that in the real world this sort of behavior tends to break things with the umpteen flaky, finicky versions of NS & IE out there today. This is especially true when your documents were a mishmash of crappy, incorrect HTML in the first place (I work at a major public university, so every professor, student, and club seems to have a different and usually wrong way of making webpages.) So, eventually, I had to decide against using TreeBuilder, even though it would have been much easier and "cooler" from a CS/data structure point of view.


In reply to Re: Re: Truncating HTML early by drix
in thread Truncating HTML early by nop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.