Wassercrats has asked for the wisdom of the Perl Monks concerning the following question:

The following would cut a html style link out of $remainder, right? It doesn't, but maybe I'm missing something. $remainder =~s/\< *?a *?href.*?\>//i;

Replies are listed 'Best First'.
Re: Regular Expression Problem
by elusion (Curate) on Aug 29, 2002 at 02:46 UTC
    You should probably use a module for this, but I'll answer anyway (I'm a re-invent the wheel person myself). You needn't escape the <>'s. Also, there can be attributes between the "a" and the "href", so you need to use a ".". I've updated the regex and hopefully made it a little cleaner too.
    $remainder =~ s/< \s*? a \s .*? href .*?>//xi;

    elusion : http://matt.diephouse.com

Re: Regular Expression Problem
by the pusher robot (Monk) on Aug 29, 2002 at 02:47 UTC
    <code> tags are your friend.

    That aside, you don't need to escape < and >, and \s is typically used to match whitespace, rather than having a literal space.

    It would be helpful if you would post what it actually does, as other people would then have a better idea of what problems to look for.
      I know, but I was going crazy and tried things that I didn't think would matter (and they didn't!). I didn't try the solution in the third reply, but I think I missed something elsewhere in my script, which I'm going to look for now. I'm preparing myself to feel stupid.
      It looks like there is no update or comment option, so I guess I'll reply to you again. The following assignment:
      $remainder =~s/<.*?a.*?href.*?>//i;
      cuts nothing out, leaving the following in $remainder:
      <a href="government.php" ONMOUSEOUT="imgInact('img2')" ONMOUSEOVER="imgAct('img2')"><img border="0" src="gova.jp +g" width="60" height="46" name="img2" alt="Arkansas Government"></a>

      (unless I'm still missing something). I want only the </a> to remain. Am I doing something wrong?
        I tested it and it removed the <a> tag, leaving <img border="0" src="gova.jpg" width="60" height="46" name="img2" alt="Arkansas Government"></a>. Do you want the <img> tag removed as well?
        Oh! There's a line-break in there! I need the alternative to . that will match a line brake!
Re: Regular Expression Problem
by sauoq (Abbot) on Aug 29, 2002 at 03:01 UTC

    I think what you were trying to accomplish is:

    $remainder =~ s/<a[^>]*>//i;

    That won't remove the end tag or the stuff in between though.

    You might want to look at HTML::LinkExtor, HTML::Parser, and HTML::TokeParser to do these kind of things reliably.

    Update: due to the discussion below, it dawned on me that I need a \b in order to avoid removing abbr tags (and one or two others that start with an "a".)

    $remainder =~ s/<a\b[^>]*>//i;

    the_pusher_robot++ for the clue.

    -sauoq
    "My two cents aren't worth a dime.";
    
      Actually, you want:
      $remainder =~ s/<a\s[^>]*>//i;
      (makes sure you only get a, not abbr, acronym, etc.)

      Update: now that I think about it, why not just: $remainder =~ s/<a\s.*?>//i; ?

      Update the second: d'oh... good catch, sauoq. How about $remainder =~ s/<a(>|\s.*?>)//i; ? Or is there a better way to do it?
        Actually, you want: . . .

        Actually, that's not what I want. Yours misses an anchor tag without attributes. As far as I know <A></A> is legal even if it isn't particularly useful. If you can show me that it isn't, I'll add the \s next time.

        I will concede that I need a \b though. :-)

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: (nrd) Regular Expression Problem
by newrisedesigns (Curate) on Aug 29, 2002 at 05:12 UTC

    On the assumption that you will be doing this for many links in (possibly) many files, I strongly suggest abandoning the regexp in favor of a module (something in the HTML::Parser family). Those modules have tested for circumstances that might be or have been overlooked (for example, the a\s or a\b problem tackled by sauoq and the pusher robot. I'm sure you'd rather strip those tags instead of revising a continuously growing regexp everytime you hit a bump in the road.

    John J Reiser
    newrisedesigns.com

    A reply falls below the community's threshold of quality. You may see it by logging in.