in reply to Re: Removing underline tags with regexp
in thread Removing underline tags with regexp

You're better off using one of the HTML parsing modules.

No you're not.

<rant>

This mantra is getting very old. In this particular instance the OP is asking about removing <u> and </u>, not about HTML parsing in the general case. Let's look at the issues in turn:

  1. <u> may be present in attribute values. That's like <image src="foo.jpg" alt="<u>"> for those following on at home. Yes, this is legal HTML, but every web monkey I have spoken to about this, without exception, has always said "I didn't even know you could do that". So in Real Life this is hardly ever arises.
  2. <u> may be present in comments. The comments may be there because part of the page has been temporarily commented out. But guess what? They probably want the replacements made in the comments, anyway!
  3. And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.

Wanting to strip out underline tags is therefore about as trivial as it gets. Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there. In this case, there are no attribute values to worry about and the element name is only one character long. You probably don't even have to worry about the tag wrapping from one line to the next. It really doesn't get any easier than this.

An entire directory of files can be done with the following one-liner:

perl -i.bak -pe 's/<\/?u>//gi' *.html

Saying that one you needs a parser to do this is just spreading FUD and making things seem more complicated than they need to be. Do simple jobs simply, and keep it simple, stupid. Save the parser approach for something hard, like converting a font-marked-up page into CSS.

</rant>

Replies are listed 'Best First'.
2Re: Removing underline tags with regexp (is a good idea)
by jeffa (Bishop) on Sep 02, 2003 at 13:05 UTC
    That's a fair cop. But here is another ... instead of just showing how to strip <u> tags from an HTML document, wouldn't (if CSS is a valid solution) it be better to show how to override <u> tags instead?
    u { text-decoration: none; }
    I understand your rant grinder, but if you have been following Tricky's problems ... you see that he just keeps adding more to the list of items to be removed from his HTML. Pretty soon his list of regexes is going to be hairy. But then again, maybe this is better for new-comers ... you don't really appreciate a good thing until you really need it. ;)

    UPDATE:
    From Tricky's home node:

    I aim to complete my MSc project by Jan 2004, which involves parsing HTML pages in a proxy server, stripping tags which impede Web access for visually impaired people (including dyslexics and colour blind), and returning to the client.
    Yep, sounds like Tricky sure could use HTML::Santizer or HTML::Scrubber ... but i would be tempted to slay this dragon with XML::LibXML and XML::LibXSLT myself ...

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Using an XML module to parse HTML is an exercise in long, excruciating, ultimately futile pain. There is a pehnomenal amount of HTML that just isn't well-formed that you just aren't going to get anywhere useful this way with anything other than HTML you generate yourself.
Re: Re: Removing underline tags with regexp (is a good idea)
by antirice (Priest) on Sep 02, 2003 at 17:04 UTC

    No offense, but you are aware that you can have attributes for the u tag, right?

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> </head> <body> <u style="text-decoration: underline overline;" title="> bob">bob</u> </body> </html>

    Hope this helps.

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

Re: Removing underline tags with regexp (is a good idea)
by Abigail-II (Bishop) on Sep 02, 2003 at 13:02 UTC
    And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.

    You mean people who put SCRIPT or STYLE elements in their HTML have an SGML background? Wow, SGML is far more popular than I thought.

    Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there.

    He may, but we don't. He didn't tell us what's in the HTML file. It's easy to just assume things, but I can play that game as well. Just assume there's no <u> present and do nothing! Assuming things without stating what you assume is pointless. Furthermore, the OP asks whether the trivial regex is the best way, or if there's another way. Hence that my answer starts with For general HTML files. Besides, if the OP is really in control of what's in the HTML files, the best answer is to not put in stuff in the files that you don't want to have there.

    Abigail