Re: Truncating HTML early

As an alternative to the other excellent suggestions you might consider using HTML::TreeBuilder - this will give you a data structure whose elements you can extract individually so that you needn't worry about splitting up nodes such as <table> . You could take this another step and use the aforementioned tidy to turn your HTML to well formed XHTML and then use XML::DOM on it ;-}

/J\

Comment on Re: Truncating HTML early Download Code

Replies are listed 'Best First'.
Re: Re: Truncating HTML early by drix (Initiate) on Mar 18, 2002 at 05:45 UTC
I would be careful using this approach. It recently fell on my shoulders to accomplish a very similar task as the poster's, namely: cut out a text-footer which was embedded in a <td>, which was embedded in a <tr>, which was embedded in a <table>...and so on, and replace it with an SSI include...on 20,000 pages, which only conform to a very loose coding standard. Naturally the first thing that came to mind was some sort of tree data structure, since I could just prune the limbs and replace them for the desired effect. So naturally the second thing that came to mind was HTML::TreeBuilder. I quickly discovered that this module is much more geared towards extracting information from an HTML file than altering one in-place. If you read the author's article in TPJ 19 you'll see as much. The module is really hampered by its lack of any semblance of an identity property. That is, in psuedocode, `$document != HTML::TreeBuilder->new($document)->dump_html();` It doesn't preserve whitespace and is apt to change your code by throwing closing tags, etc. While this is all to spec for HTML, we all know that in the real world this sort of behavior tends to break things with the umpteen flaky, finicky versions of NS & IE out there today. This is especially true when your documents were a mishmash of crappy, incorrect HTML in the first place (I work at a major public university, so every professor, student, and club seems to have a different and usually wrong way of making webpages.) So, eventually, I had to decide against using TreeBuilder, even though it would have been much easier and "cooler" from a CS/data structure point of view.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Truncating HTML early
by drix (Initiate) on Mar 18, 2002 at 05:45 UTC

I quickly discovered that this module is much more geared towards extracting information from an HTML file than altering one in-place. If you read the author's article in TPJ 19 you'll see as much. The module is really hampered by its lack of any semblance of an identity property. That is, in psuedocode, $document != HTML::TreeBuilder->new($document)->dump_html(); It doesn't preserve whitespace and is apt to change your code by throwing closing tags, etc. While this is all to spec for HTML, we all know that in the real world this sort of behavior tends to break things with the umpteen flaky, finicky versions of NS & IE out there today. This is especially true when your documents were a mishmash of crappy, incorrect HTML in the first place (I work at a major public university, so every professor, student, and club seems to have a different and usually wrong way of making webpages.) So, eventually, I had to decide against using TreeBuilder, even though it would have been much easier and "cooler" from a CS/data structure point of view.

[reply]
[d/l]