Simply closing tags is not enough to clean up HTML. Not all HTML tags are paired and placed around text. In particular, in XHTML and strict HTML, <BR> is normally written <BR/>. It is used to mark line breaks, not paragraphs.

Your program will have to do three things:

  1. decide what sort of tag you have (by extracting the tag name from <tag ....>)
  2. use the tag name to decide what the cleanup procedure should be
  3. implement the corrective action

There are already several programs on CPAN that can do all of this for you, among them HTML::Tidy and HTML::Lint

If you want to do this on your own, please keep in mind that the first step, parsing HTML properly, is non-trivial, especially if the HTML is poorly formatted HTML. Parsing HTML is one of those things that looks like one should be able to parse it easily using some sort of regular expression, but its habit of nesting tags makes that much more difficult. Even Andy Lester didn't try to do it on his own when he wrote HTML::Lint. He used HTML::Parser and you may want to do that as well.

For Step 2, you will want to a close look at the WWW specifications for HTML 4.01 (strict) and XHTML 1.0. They will help you decide how you should clean up each particular tag.

The parsing process stores tags, attributes, and text in data structures, so step 3 simply involves navigating the data structures and turning them into strings. This requires a mastery of both data structures (see perldsc) and various string operators. If you are new to Perl, you might find perlop helpful. It contains descriptions of Perl's string concatenation operator (.), interpolating quotes (which allow you to insert variables into strings without using the concatenation operator), non-interpolating quotes (which save you from lots of ugly escape characters) and here documents which are useful for long blocks of generated text (look for the string 'here-doc'). For converting tags to a standardized case, you may want to look at lc, uc and ucfirst.

Best, beth


In reply to Re: close end tag by ELISHEVA
in thread close end tag by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.