I have HTML stored in a database used by a publishing system. The HTML is authored by in-house employees, so it is reasonably correct (tags nested and closed properly, etc), known to be safe (no insidious javascript), reasonably consistent in style (no font tags), and reasonably well-written (good grammar, factually correct, has been professionally proofread, etc.).
I am working with a certain field that sometimes is quite long. Its is stored as TEXT in SqlServer, and may go on for 20 pages in extreme cases.
I need to repurpose this information for a different use, and it has to be shorter. My goal is to chop the HTML somewhere around the first 1000 words or so, and if the original text had been longer (eg if my truncation removed content), append a message "Click here for rest of article" sort of deal.
My question is,
how do I truncate HTML cleanly? By "cleanly," I mean so that, after my truncation, my chopped-and-patched HTML is well-formed.
Clearly I can use HTML::Parser to avoid chopping a tag in half, but how do I know I'm not in the middle of a table or inside the label of a link when I chop?
Since it is possible I'm always inside a tag (say the entire field is wrapped in open and close SPAN tags), probably my best bet is to
close all open tags when I truncate.
I could keep track of my open tags using a stack, pushing on opens and popping off closes (hmmm... would also let me check for badly-nested tags at same time, which I know will reveal problems), but then how do I know when a tag doesn't need a closing tag? That is, if I blindly push tags when they open, my stack will be loaded with IMG tags, HR tags, P tags, etc.
Can someone point me to a list of tags that don't need to be closed, or, better,
offer a better way to approach this problem? I didn't find anything here or on CPAN.
Thanks!
nop
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.