fourmi has asked for the wisdom of the Perl Monks concerning the following question:

I have the following in a $Var
<!--ReallyGreedy--> <!--LessGreedy--> <b>Lots and lots of html markup that isn't really releveant </b> <img src=WantToKeepThis.gif> <b>More html markup that isn't really releveant </b> <img id=Changeable src=WantToGetRidOfThis.gif width=AlsoChangeable> <!--LessGreedy--> <b>Even more html markup that isn't really releveant </b> <img src=WantToKeepThisToo.gif> <!--ReallyGreedy-->
I have a series of gifs (lets say three) across a couple of hundred web pages, and i want to get rid of the central one, which has variable markup in the img tag (but does always contain 'blah.gif', variable markup wrt attributes etc).

I have avoided the ReallyGreedy regexp /<img.*WantToGetRidOfThis.gif.*>/ by using .+?

/<img.+?WantToGetRidOfThis.gif.+?>/ but it still matches from the very first instance of img (LessGreedy), so it's like 'backwardly greedy!'
How do i set up a regexp so that it does the equivelent of finding WantToGetRidOfThis.gif and then gorwing outward ungreedily until "<img" is found on the left, and ">" on the right?

Cheers
ant

Replies are listed 'Best First'.
Re: Greedy RegExp
by Roy Johnson (Monsignor) on Mar 15, 2004 at 16:00 UTC
    Can you exclude angle brackets from your wildcards? That should prevent crossing tag boundaries.
    /<img[^<]*WantToGetRidOfThis.gif[^>]*>/

    The PerlMonk tr/// Advocate
      Nail on the head. Thanks. Had wondered if there was an antigreedifier like .+? but this does the job brilliantly. soemtime i wish i didn't think so hard!
      cheers
      ant
        Had wondered if there was an antigreedifier like .+?
        Oh, there is. And it's spelled .+?. But greedy or lazy (aka 'antigreedy'), still has to yield to more important rules like finding the left-most match.

        Abigail

Re: Greedy RegExp
by Abigail-II (Bishop) on Mar 15, 2004 at 16:01 UTC
    Untested:
    /<img \s+ id \s* = \s* (?:'[^']*'|"[^"]*"|[-\w]+) \s+ src \s* = \s* WantToGetRidOfThis\.gif \s+ width \s* = \s* (?:'[^']*'|"[^"]*"|[-\w]+) \s* >/x;

    Of course, you're probably much better off using an HTML parser.

    Abigail

      wow, have a quick fix now, but will investigate the parser idea as soon as i can.. cheers!
Re: Greedy RegExp
by Roy Johnson (Monsignor) on Mar 15, 2004 at 16:14 UTC
    You can also use negative lookahead to ensure that you don't have (for example) multiple src tags between the img and WantToGetRidOfThis.gif:
    /<img(?!.*src=.*src=WantToGetRidOfThis.gif).*?WantToGetRidOfThis.gif.+ +?>/

    The PerlMonk tr/// Advocate
      negative lookahead. will need to read up about that one!
      cheers again!