Re: Re: Truncating HTML early

I would be careful using this approach. It recently fell on my shoulders to accomplish a very similar task as the poster's, namely: cut out a text-footer which was embedded in a <td>, which was embedded in a <tr>, which was embedded in a <table>...and so on, and replace it with an SSI include...on 20,000 pages, which only conform to a very loose coding standard. Naturally the first thing that came to mind was some sort of tree data structure, since I could just prune the limbs and replace them for the desired effect. So naturally the second thing that came to mind was HTML::TreeBuilder.

I quickly discovered that this module is much more geared towards extracting information from an HTML file than altering one in-place. If you read the author's article in TPJ 19 you'll see as much. The module is really hampered by its lack of any semblance of an identity property. That is, in psuedocode, $document != HTML::TreeBuilder->new($document)->dump_html(); It doesn't preserve whitespace and is apt to change your code by throwing closing tags, etc. While this is all to spec for HTML, we all know that in the real world this sort of behavior tends to break things with the umpteen flaky, finicky versions of NS & IE out there today. This is especially true when your documents were a mishmash of crappy, incorrect HTML in the first place (I work at a major public university, so every professor, student, and club seems to have a different and usually wrong way of making webpages.) So, eventually, I had to decide against using TreeBuilder, even though it would have been much easier and "cooler" from a CS/data structure point of view.

Comment on Re: Re: Truncating HTML early Download Code