Tricky has asked for the wisdom of the Perl Monks concerning the following question:

Another joyous rumble with regexps!

I'm stuck with removing underline tags from my HTML file. I want to remove the tags, leaving the text string untouched.

Is the simplest solution the best, i.e. two VERY simple regexps, /<u>/ig and /<\/u>/ig, or is there another way around this?

Many thanks,

P.S. Cheers for the help during the weekend, in particular to jeffa. If fewer lines can do the same job, so much the better...

Replies are listed 'Best First'.
Re: Removing underline tags with regexp
by jeffa (Bishop) on Sep 02, 2003 at 12:53 UTC
    See? We keep telling you to use a parser for this ... you keep refusing and you keep coming back with quetions. Next you'll be wanting to do something to <b> and <i> tags. :)

    In Regexp conundrum you specify that you want to "remove letter-spacing and word-spacing attributes from my in-document style sheet." Sounds like you should specify a second style sheet that looks the way you want it to. If this is your site ... problem solved, if not, then you could use regular expressions to change to a different CSS file. I would use HTML::TokeParser::Simple. If this is not your site and the CSS is all inline ... then maybe you could simply "turn off "CSS. There are so many potential solutions ... this is why you should be very descriptive in what you are doing and why you are doing it.

    In Re: Problems Trying to Send E-Mail and Insert into Database, you say that "I've managed to extract image tags and re-write the changes, so I'm heading in the right direction..." No. You've been going the wrong direction ever since. :(

    In CPAN module download probs..., you do good! You tried to install a CPAN module, but it didn't work. More than likely, you tried to install a tar ball (which is for Unix) on a Windows box. Since you are using Activestate, you should be using PPM. barbie gave you a clue about this at Re: CPAN module download probs..., did you see this after you posted Re: CPAN module download probs...?

    Ok, enough tracking. ;) Where am i going with all of this? It is time for you to learn how to install PPM's. Me? I have never done it. I use Linux. But, if you invest as much energy into getting PPM's to work on your computer as you have trying to take down windmills with a lance (parsing HTML with regular expressions), you will succeed. And then you will have a nice set of modules to help you with this task.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Removing underline tags with regexp
by broquaint (Abbot) on Sep 02, 2003 at 10:32 UTC
    Or one simple regex replace
    s{</?u>}()ig;
    See. perlre for more info.
    HTH

    _________
    broquaint

      To be as robust as possible with a RE

      s!<\s*/?\s*u\s*>!!ig;

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Removing underline tags with regexp
by Elian (Parson) on Sep 02, 2003 at 12:51 UTC
    The opinions so far are about right--while in the general case you shouldn't, in the normal case you can. However, I'd consider the regex:
    s/<(\/?)u>/<!-- $1 u -->
    just so you can reasonably easily find where you nuked things just in case you ended up munching tags you didn't want to. (A good backup of the files wouldn't be out of order either...)

    Update:Missing a slash. Fixed.

Re: Removing underline tags with regexp
by Abigail-II (Bishop) on Sep 02, 2003 at 10:43 UTC
    For general HTML files, removing tags cannot be approached with simple regexes. <u> may be present in attribute values, comments, or CDATA declared sections. You're better off using one of the HTML parsing modules.

    Abigail

      You're better off using one of the HTML parsing modules.

      No you're not.

      <rant>

      This mantra is getting very old. In this particular instance the OP is asking about removing <u> and </u>, not about HTML parsing in the general case. Let's look at the issues in turn:

      1. <u> may be present in attribute values. That's like <image src="foo.jpg" alt="<u>"> for those following on at home. Yes, this is legal HTML, but every web monkey I have spoken to about this, without exception, has always said "I didn't even know you could do that". So in Real Life this is hardly ever arises.
      2. <u> may be present in comments. The comments may be there because part of the page has been temporarily commented out. But guess what? They probably want the replacements made in the comments, anyway!
      3. And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.

      Wanting to strip out underline tags is therefore about as trivial as it gets. Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there. In this case, there are no attribute values to worry about and the element name is only one character long. You probably don't even have to worry about the tag wrapping from one line to the next. It really doesn't get any easier than this.

      An entire directory of files can be done with the following one-liner:

      perl -i.bak -pe 's/<\/?u>//gi' *.html

      Saying that one you needs a parser to do this is just spreading FUD and making things seem more complicated than they need to be. Do simple jobs simply, and keep it simple, stupid. Save the parser approach for something hard, like converting a font-marked-up page into CSS.

      </rant>

        That's a fair cop. But here is another ... instead of just showing how to strip <u> tags from an HTML document, wouldn't (if CSS is a valid solution) it be better to show how to override <u> tags instead?
        u { text-decoration: none; }
        I understand your rant grinder, but if you have been following Tricky's problems ... you see that he just keeps adding more to the list of items to be removed from his HTML. Pretty soon his list of regexes is going to be hairy. But then again, maybe this is better for new-comers ... you don't really appreciate a good thing until you really need it. ;)

        UPDATE:
        From Tricky's home node:

        I aim to complete my MSc project by Jan 2004, which involves parsing HTML pages in a proxy server, stripping tags which impede Web access for visually impaired people (including dyslexics and colour blind), and returning to the client.
        Yep, sounds like Tricky sure could use HTML::Santizer or HTML::Scrubber ... but i would be tempted to slay this dragon with XML::LibXML and XML::LibXSLT myself ...

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
        And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.

        You mean people who put SCRIPT or STYLE elements in their HTML have an SGML background? Wow, SGML is far more popular than I thought.

        Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there.

        He may, but we don't. He didn't tell us what's in the HTML file. It's easy to just assume things, but I can play that game as well. Just assume there's no <u> present and do nothing! Assuming things without stating what you assume is pointless. Furthermore, the OP asks whether the trivial regex is the best way, or if there's another way. Hence that my answer starts with For general HTML files. Besides, if the OP is really in control of what's in the HTML files, the best answer is to not put in stuff in the files that you don't want to have there.

        Abigail

        No offense, but you are aware that you can have attributes for the u tag, right?

        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> </head> <body> <u style="text-decoration: underline overline;" title="> bob">bob</u> </body> </html>

        Hope this helps.

        antirice    
        The first rule of Perl club is - use Perl
        The
        ith rule of Perl club is - follow rule i - 1 for i > 1

      But oh, for the sport of it.