Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl while(<DATA>){ s/<(\w*)>(?![^<\w*\>]*<\w*\\>)/<\/$1>/g; print $_; } __DATA__ Susan Kempf<BR>LONGWOOD<BR> <p>Joe</p> DJ ROB-E ORLANDO<QC> BREAKZ, VOL. 2 <Hi>How r y</HI><br> [HELOO]
How to print ending tag if the clsing tag doesnot exists before the start of another tag or start of character '[' Where the output should be like
Susan Kempf<BR>LONGWOOD</BR><BR> </BR><p>Joe</p> DJ ROB-E ORLANDO<QC> BREAKZ, VOL. 2</QC> <Hi>How r y</HI><br> </br>[HELLO]

Replies are listed 'Best First'.
Re: close end tag
by ELISHEVA (Prior) on Aug 27, 2009 at 06:14 UTC

    Simply closing tags is not enough to clean up HTML. Not all HTML tags are paired and placed around text. In particular, in XHTML and strict HTML, <BR> is normally written <BR/>. It is used to mark line breaks, not paragraphs.

    Your program will have to do three things:

    1. decide what sort of tag you have (by extracting the tag name from <tag ....>)
    2. use the tag name to decide what the cleanup procedure should be
    3. implement the corrective action

    There are already several programs on CPAN that can do all of this for you, among them HTML::Tidy and HTML::Lint

    If you want to do this on your own, please keep in mind that the first step, parsing HTML properly, is non-trivial, especially if the HTML is poorly formatted HTML. Parsing HTML is one of those things that looks like one should be able to parse it easily using some sort of regular expression, but its habit of nesting tags makes that much more difficult. Even Andy Lester didn't try to do it on his own when he wrote HTML::Lint. He used HTML::Parser and you may want to do that as well.

    For Step 2, you will want to a close look at the WWW specifications for HTML 4.01 (strict) and XHTML 1.0. They will help you decide how you should clean up each particular tag.

    The parsing process stores tags, attributes, and text in data structures, so step 3 simply involves navigating the data structures and turning them into strings. This requires a mastery of both data structures (see perldsc) and various string operators. If you are new to Perl, you might find perlop helpful. It contains descriptions of Perl's string concatenation operator (.), interpolating quotes (which allow you to insert variables into strings without using the concatenation operator), non-interpolating quotes (which save you from lots of ugly escape characters) and here documents which are useful for long blocks of generated text (look for the string 'here-doc'). For converting tags to a standardized case, you may want to look at lc, uc and ucfirst.

    Best, beth

      #!/usr/bin/perl while($line = <DATA>){ @valid_entities= ('<a>','<td>','<th>','<var>','<br>'); my %htmlenties = map { $_ =>1 } @valid_entities; $line =~ s/(<(\w+?)(>))/exists $htmlenties{$1} ? $1 : defined ($2) ? +"&lt;$2$3" : '&lt;'/eg; print $line; } __DATA__ <Hello>Hi...<BR>how r u<br>
      Can I replace '<' to '<' and '>' to '>' if it is not a HTML element? I have coded to replace < but not able to add >
        What does this mean? You want to convert '<' to  &lt; and '>' to &gt;
Re: close end tag
by ssandv (Hermit) on Aug 27, 2009 at 05:33 UTC

    <br /> tags don't have closing tags in html. They're either written as I've done here, or written without the / which will work fine if parsed as html but fail for xhtml, which requires the closing / (but it goes inside the same tag for br). You should not be looking for </br> tags, because they should not be there.

    (Update: except, as ikegami rightly points out below, you can have <br></br>tags in xhtml.)

      <br /> tags don't have closing tags in html.

      That statement is wrong by definition. <br/> is short for <br></br>, so you're saying that something with a closing tag can't have a closing tag.

      (Technically, <br/.../ is short for <br>...</br> in HTML, but the HTML parsers used by web browsers don't support that. If they did, it would also be invalid since the BR element cannot have content or a closing tag.)

      <br></br> is perfectly valid XHTML. It's the unabbreviated form of <br/>.

        I think the intent was clear. If you prefer, read it as "don't have separate closing tags in html."