Simply closing tags is not enough to clean up HTML. Not all HTML tags are paired and placed around text. In particular, in XHTML and strict HTML, <BR> is normally written <BR/>. It is used to mark line breaks, not paragraphs.
Your program will have to do three things:
There are already several programs on CPAN that can do all of this for you, among them HTML::Tidy and HTML::Lint
If you want to do this on your own, please keep in mind that the first step, parsing HTML properly, is non-trivial, especially if the HTML is poorly formatted HTML. Parsing HTML is one of those things that looks like one should be able to parse it easily using some sort of regular expression, but its habit of nesting tags makes that much more difficult. Even Andy Lester didn't try to do it on his own when he wrote HTML::Lint. He used HTML::Parser and you may want to do that as well.
For Step 2, you will want to a close look at the WWW specifications for HTML 4.01 (strict) and XHTML 1.0. They will help you decide how you should clean up each particular tag.
The parsing process stores tags, attributes, and text in data structures, so step 3 simply involves navigating the data structures and turning them into strings. This requires a mastery of both data structures (see perldsc) and various string operators. If you are new to Perl, you might find perlop helpful. It contains descriptions of Perl's string concatenation operator (.), interpolating quotes (which allow you to insert variables into strings without using the concatenation operator), non-interpolating quotes (which save you from lots of ugly escape characters) and here documents which are useful for long blocks of generated text (look for the string 'here-doc'). For converting tags to a standardized case, you may want to look at lc, uc and ucfirst.
Best, beth
In reply to Re: close end tag
by ELISHEVA
in thread close end tag
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |