Re: Truncating HTML early

Seems to me like you've come up with a pretty rational approach. You can get the official list of empty tags (BR, IMG, etc.) and tags for which end tags are optional (these are trickier to handle, in my experience) from the HTML4 spec at the W3C. This page should tell you what you need to know. (Of course, there's no guarantee that the HTML is strictly 4.01-compliant, but since you're talking about in-house documents that may not be a huge problem.)

From a process perspective, you could try starting with a pretty simply implementation, run it against the data set, then put the output through an automatic HTML validator to see where your solution breaks down in practice. With a couple of iterations of that you might be able to get through 95% of the material and decide that the other 5% can get tweaked manually in less time than it would take you to write code to handle all the bizarro special cases. (This is assuming you're using the script to deal with backlogged text and don't need to worry about coping with those special cases in the future, of course.)

Comment on Re: Truncating HTML early