close end tag

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: close end tag by ELISHEVA (Prior) on Aug 27, 2009 at 06:14 UTC
Simply closing tags is not enough to clean up HTML. Not all HTML tags are paired and placed around text. In particular, in XHTML and strict HTML, `<BR>` is normally written `<BR/>`. It is used to mark line breaks, not paragraphs. Your program will have to do three things: decide what sort of tag you have (by extracting the tag name from `<tag ....>`) use the tag name to decide what the cleanup procedure should be implement the corrective action There are already several programs on CPAN that can do all of this for you, among them HTML::Tidy and HTML::Lint If you want to do this on your own, please keep in mind that the first step, parsing HTML properly, is non-trivial, especially if the HTML is poorly formatted HTML. Parsing HTML is one of those things that looks like one should be able to parse it easily using some sort of regular expression, but its habit of nesting tags makes that much more difficult. Even Andy Lester didn't try to do it on his own when he wrote HTML::Lint. He used HTML::Parser and you may want to do that as well. For Step 2, you will want to a close look at the WWW specifications for HTML 4.01 (strict) and XHTML 1.0. They will help you decide how you should clean up each particular tag. The parsing process stores tags, attributes, and text in data structures, so step 3 simply involves navigating the data structures and turning them into strings. This requires a mastery of both data structures (see perldsc) and various string operators. If you are new to Perl, you might find perlop helpful. It contains descriptions of Perl's string concatenation operator (`.`), interpolating quotes (which allow you to insert variables into strings without using the concatenation operator), non-interpolating quotes (which save you from lots of ugly escape characters) and here documents which are useful for long blocks of generated text (look for the string 'here-doc'). For converting tags to a standardized case, you may want to look at lc, uc and ucfirst. Best, beth	[reply] [d/l] [select]
Re^2: close end tag by Anonymous Monk on Aug 27, 2009 at 07:04 UTC
`#!/usr/bin/perl while($line = <DATA>){ @valid_entities= ('<a>','<td>','<th>','<var>','<br>'); my %htmlenties = map { $_ =>1 } @valid_entities; $line =~ s/(<(\w+?)(>))/exists $htmlenties{$1} ? $1 : defined ($2) ? +"<$2$3" : '<'/eg; print $line; } __DATA__ <Hello>Hi...<BR>how r u<br>` [download] Can I replace '<' to '<' and '>' to '>' if it is not a HTML element? I have coded to replace < but not able to add >	[reply] [d/l]
Re^3: close end tag by Anonymous Monk on Aug 27, 2009 at 07:12 UTC
What does this mean? You want to convert '<' to `<` and '>' to `>`	[reply] [d/l] [select]
Re^4: close end tag by Anonymous Monk on Aug 27, 2009 at 08:51 UTC
Re^5: close end tag by Anonymous Monk on Aug 27, 2009 at 09:07 UTC
Re: close end tag by ssandv (Hermit) on Aug 27, 2009 at 05:33 UTC
`<br />` tags don't have closing tags in html. They're either written as I've done here, or written without the / which will work fine if parsed as html but fail for xhtml, which requires the closing / (but it goes inside the same tag for br). You should not be looking for `</br>` tags, because they should not be there. (Update: except, as ikegami rightly points out below, you can have `<br></br>`tags in xhtml.)	[reply] [d/l] [select]
Re^2: close end tag by ikegami (Patriarch) on Aug 27, 2009 at 07:00 UTC
`<br />` tags don't have closing tags in html. That statement is wrong by definition. `<br/>` is short for `<br></br>`, so you're saying that something with a closing tag can't have a closing tag. (Technically, `<br/.../` is short for `<br>...</br>` in HTML, but the HTML parsers used by web browsers don't support that. If they did, it would also be invalid since the BR element cannot have content or a closing tag.) `<br></br>` is perfectly valid XHTML. It's the unabbreviated form of `<br/>`.	[reply] [d/l] [select]
Re^3: close end tag by ssandv (Hermit) on Aug 27, 2009 at 15:47 UTC
I think the intent was clear. If you prefer, read it as "don't have separate closing tags in html."	[reply]
Re^4: close end tag by ikegami (Patriarch) on Aug 27, 2009 at 15:51 UTC
Re^5: close end tag by ssandv (Hermit) on Aug 27, 2009 at 15:54 UTC