Seems to me like you've come up with a pretty rational approach. You can get the official list of empty tags (BR, IMG, etc.) and tags for which end tags are optional (these are trickier to handle, in my experience) from the HTML4 spec at the W3C. This page should tell you what you need to know. (Of course, there's no guarantee that the HTML is strictly 4.01-compliant, but since you're talking about in-house documents that may not be a huge problem.)

From a process perspective, you could try starting with a pretty simply implementation, run it against the data set, then put the output through an automatic HTML validator to see where your solution breaks down in practice. With a couple of iterations of that you might be able to get through 95% of the material and decide that the other 5% can get tweaked manually in less time than it would take you to write code to handle all the bizarro special cases. (This is assuming you're using the script to deal with backlogged text and don't need to worry about coping with those special cases in the future, of course.)


In reply to Re: Truncating HTML early by seattlejohn
in thread Truncating HTML early by nop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.