I'm working on adding code to an (already working) mailing list converter. The converter converts emails to html, suitable for use in archives. Some features that I would like to add are: to get rid of whitespace at the end (easy), HTML whitepsace (medium), and trailing blockquotes (quite hard).
The email is already stored locally in a file, and is converted to a variable in memory. By end of the file, I simply mean the last byte (character for those of us who speak Unicode) of the variable. The </body> etc tags will be added on later.
The emails will be coming from a variety of sources and lists. The one I posted above is simply an example, but I can't count on its specifics.
The html shoul be well formed, with the understanding that the <body>and</body> tags will be left out. There should be no tables or anything - it should really be more or less straight text, with simple markup - as you would find in an email sent as text/html. Again, by "end" I only mean the end of the variable.
I hope I've provided everything that you've asked - if not, please let me know.
And I hope someone here has some ideas, since I'm stumped, specifically on the last one (removing (possibly nested blockquotes)).
PS Yes I realize that regex should never be used for real html, they can get confused by html in comments, and whitespace or attributes in the middle of a tag, etc - I was only trying to give a simple example of why they were totally inadequate here. The code, as is, can already convert
In reply to Re^2: Using Perl to snip the end off of HTML
by eastcoastcoder
in thread Using Perl to snip the end off of HTML
by eastcoastcoder
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |