Truncating HTML early

nop has asked for the wisdom of the Perl Monks concerning the following question:

I have HTML stored in a database used by a publishing system. The HTML is authored by in-house employees, so it is reasonably correct (tags nested and closed properly, etc), known to be safe (no insidious javascript), reasonably consistent in style (no font tags), and reasonably well-written (good grammar, factually correct, has been professionally proofread, etc.).

I am working with a certain field that sometimes is quite long. Its is stored as TEXT in SqlServer, and may go on for 20 pages in extreme cases.

I need to repurpose this information for a different use, and it has to be shorter. My goal is to chop the HTML somewhere around the first 1000 words or so, and if the original text had been longer (eg if my truncation removed content), append a message "Click here for rest of article" sort of deal.

My question is, how do I truncate HTML cleanly? By "cleanly," I mean so that, after my truncation, my chopped-and-patched HTML is well-formed.

Clearly I can use HTML::Parser to avoid chopping a tag in half, but how do I know I'm not in the middle of a table or inside the label of a link when I chop?

Since it is possible I'm always inside a tag (say the entire field is wrapped in open and close SPAN tags), probably my best bet is to close all open tags when I truncate.

I could keep track of my open tags using a stack, pushing on opens and popping off closes (hmmm... would also let me check for badly-nested tags at same time, which I know will reveal problems), but then how do I know when a tag doesn't need a closing tag? That is, if I blindly push tags when they open, my stack will be loaded with IMG tags, HR tags, P tags, etc.

Can someone point me to a list of tags that don't need to be closed, or, better, offer a better way to approach this problem? I didn't find anything here or on CPAN.

Thanks!

nop

Comment on Truncating HTML early

Replies are listed 'Best First'.
Re: Truncating HTML early by blakem (Monsignor) on Mar 17, 2002 at 09:58 UTC
You might try running the truncated text through HTML Tidy. Assuming the original HTML is as clean as you claim it to be, the only problems tidy will need to address are the dangling-tags at the end of the file.... -Blake	[reply]
Re: Re: Truncating HTML early by nop (Hermit) on Mar 17, 2002 at 10:25 UTC
This sounds like great solution, blakem -- just smash the code and let something else clean it up! Laziness as a Virtue. Will try this later today -- thanks.	[reply]
Re: Truncating HTML early by seattlejohn (Deacon) on Mar 17, 2002 at 10:04 UTC
Seems to me like you've come up with a pretty rational approach. You can get the official list of empty tags (BR, IMG, etc.) and tags for which end tags are optional (these are trickier to handle, in my experience) from the HTML4 spec at the W3C. This page should tell you what you need to know. (Of course, there's no guarantee that the HTML is strictly 4.01-compliant, but since you're talking about in-house documents that may not be a huge problem.) From a process perspective, you could try starting with a pretty simply implementation, run it against the data set, then put the output through an automatic HTML validator to see where your solution breaks down in practice. With a couple of iterations of that you might be able to get through 95% of the material and decide that the other 5% can get tweaked manually in less time than it would take you to write code to handle all the bizarro special cases. (This is assuming you're using the script to deal with backlogged text and don't need to worry about coping with those special cases in the future, of course.)	[reply]
Re: Truncating HTML early by gellyfish (Monsignor) on Mar 17, 2002 at 11:17 UTC
As an alternative to the other excellent suggestions you might consider using HTML::TreeBuilder - this will give you a data structure whose elements you can extract individually so that you needn't worry about splitting up nodes such as <table> . You could take this another step and use the aforementioned `tidy` to turn your HTML to well formed XHTML and then use XML::DOM on it ;-} /J\	[reply] [d/l]
Re: Re: Truncating HTML early by drix (Initiate) on Mar 18, 2002 at 05:45 UTC
I would be careful using this approach. It recently fell on my shoulders to accomplish a very similar task as the poster's, namely: cut out a text-footer which was embedded in a <td>, which was embedded in a <tr>, which was embedded in a <table>...and so on, and replace it with an SSI include...on 20,000 pages, which only conform to a very loose coding standard. Naturally the first thing that came to mind was some sort of tree data structure, since I could just prune the limbs and replace them for the desired effect. So naturally the second thing that came to mind was HTML::TreeBuilder. I quickly discovered that this module is much more geared towards extracting information from an HTML file than altering one in-place. If you read the author's article in TPJ 19 you'll see as much. The module is really hampered by its lack of any semblance of an identity property. That is, in psuedocode, `$document != HTML::TreeBuilder->new($document)->dump_html();` It doesn't preserve whitespace and is apt to change your code by throwing closing tags, etc. While this is all to spec for HTML, we all know that in the real world this sort of behavior tends to break things with the umpteen flaky, finicky versions of NS & IE out there today. This is especially true when your documents were a mishmash of crappy, incorrect HTML in the first place (I work at a major public university, so every professor, student, and club seems to have a different and usually wrong way of making webpages.) So, eventually, I had to decide against using TreeBuilder, even though it would have been much easier and "cooler" from a CS/data structure point of view.	[reply] [d/l]
Re: Truncating HTML early by erikharrison (Deacon) on Mar 17, 2002 at 15:35 UTC
Have you considered using RSS? It's a totally different approach to your problem - instead of extracting data from the HTML, and then try to clean it up, RSS could allow you to build quick, clean overviews of the pages, and place them in various "channels" on a front page. Perl.com has several good articles on RSS. There is a good one at http://www.perl.com/pub/a/2001/11/15/creatingrss.html. Cheers, Erik	[reply]
Re: Truncating HTML early by fokat (Deacon) on Mar 17, 2002 at 17:57 UTC
In addition to the good suggestions offered by fellow monks, I would point out that your users might also appreciate a special tag to tell your code where to split the page. This would allow your content-producers to insert this tag, let' s say a particular comment, in a place where it makes sense syntactically and semantically (instead of just syntactically). The way I see it, it could work like a very strong hint about where to break up the HTML. Since I get from your post that your producers are thrust-worthy, this might add a lot of value to what you' re trying to achieve while keeping the risk of abuse to a minimum. Regards.	[reply]
Re: Truncating HTML early by Anonymous Monk on Mar 17, 2002 at 20:28 UTC
Slash (the code that runs Slashdot) has to do this often. For display in the main comment list, we truncate users' comments that exceed a certain length, and then have to close up all the tags and such. This is done near the top of the dispComment() function, in Slash.pm. Basically, we call chopEntity() which truncates to a given size without interrupting an HTML entity; strip_html() which takes out any illegal HTML tags; then balanceTags() which rebalances everything. Those are all in Slash::Utility::Data. You can ignore the </A> fixing and addDomainTags() since you won't be using that of course. - Jamie	[reply]
Re: Truncating HTML early by gav^ (Curate) on Mar 17, 2002 at 19:46 UTC
Not knowing what the HTML looks like, this may or may not be a too simplistic solution. Break the text up into paragraphs (hopefully you will have open and closed <p> tags) and then count the number of words in each paragraph. When you have 1000 stop outputting. This should mean that all your tags are closed (if your HTML is good) and that you are not breaking in the middle of a paragraph which looks messy. gav^	[reply]
Re: Truncating HTML early by Anonymous Monk on Mar 17, 2002 at 21:47 UTC
The best way to do this would be to use HTML::Parser to parse the entire document into a tree-like structure, useing hashrefs to store information about each element (elements includeing tags, and text). Then, useing recursion, go through the tree, printing out each element, and summing the number of words of plain-text printed. Then, at the top of the recursed sub, put an if statement, checking to see if the number of words is greater than a certain sum, or if you're in a table or any other such tag you want to specify. - Silicon	[reply]
Re: Truncating HTML early by petdance (Parson) on Mar 18, 2002 at 04:43 UTC
I don't have suggestions on doing the truncating, but whatever method you've chosen, you can send the resulting output into an instance of my HTML::Lint object and validate that it's still well-formed. xoxo, Andy -- <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>	[reply]
Re: Truncating HTML early by nop (Hermit) on Mar 18, 2002 at 18:59 UTC
Thanks, all for your excellent suggestions. While working on this today, I had yet another idea, so I thought I'd share it here. later edit by nop: the following suggestion was not a good one -- leaving in empty tags creates HTML littered with empty bulleted lists, etc. yuck. my final solution was to stop taking text after some limit of words, then use tidy to clean out the empty tags. original post: Assuming there's much text relative to markup, and assuming the markup is reasonably well-formed, an easy solution to this problem using HTML::Parser is count words in the text handler, and when that count exceeds a limit, have the handler stop appending new text to the result string. Allow the start and end handlers to continue adding their tags. The idea here is that when the text limit is reached, the rest of the markup (probably not much) will flow out empty. This wouldn't be a good solution for html with heavy markup, but I think it may work in my case....... nop	[reply]