You're better off using one of the HTML parsing modules.

No you're not.

<rant>

This mantra is getting very old. In this particular instance the OP is asking about removing <u> and </u>, not about HTML parsing in the general case. Let's look at the issues in turn:

  1. <u> may be present in attribute values. That's like <image src="foo.jpg" alt="<u>"> for those following on at home. Yes, this is legal HTML, but every web monkey I have spoken to about this, without exception, has always said "I didn't even know you could do that". So in Real Life this is hardly ever arises.
  2. <u> may be present in comments. The comments may be there because part of the page has been temporarily commented out. But guess what? They probably want the replacements made in the comments, anyway!
  3. And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.

Wanting to strip out underline tags is therefore about as trivial as it gets. Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there. In this case, there are no attribute values to worry about and the element name is only one character long. You probably don't even have to worry about the tag wrapping from one line to the next. It really doesn't get any easier than this.

An entire directory of files can be done with the following one-liner:

perl -i.bak -pe 's/<\/?u>//gi' *.html

Saying that one you needs a parser to do this is just spreading FUD and making things seem more complicated than they need to be. Do simple jobs simply, and keep it simple, stupid. Save the parser approach for something hard, like converting a font-marked-up page into CSS.

</rant>


In reply to Re: Removing underline tags with regexp (is a good idea) by grinder
in thread Removing underline tags with regexp by Tricky

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.