scogreen has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a problem with a regex that seems to be greedy eventhough I explicitly tell it not to be. Could a monk help me out? $str = 'Some TextVenture</B Brothers</a>'
$str =~ s/\<.*?\>//;

I'd expect the result to be

$str = 'Some TextVenture</B Brothers'

on account of the non-greedy match, but it seems to be making a greedy match and leaving

$str = 'Some TextVenture'

Does anyone know why I'm not getting the behavior I expect?

Replies are listed 'Best First'.
Re: Regular expression seems to be greedy
by Eimi Metamorphoumai (Deacon) on Nov 09, 2004 at 21:40 UTC
    Basically, greediness is only one consideration for how regexps find their matches, and in this case it's not affecting your results. What perl is doing is first looking for the first < character, then when it finds it, it looks forward (non-greedily) for the first matching >. There are a few ways around this, but I think what you might find best would be
    $str = 'Some TextVenture</B Brothers</a>'; $str =~ s/<[^<]*>//;
    Update: See How will my regular expression match? for more details on why greediness isn't the only factor.
Re: Regular expression seems to be greedy
by Joost (Canon) on Nov 09, 2004 at 21:47 UTC
    Though the match between < and > is non-greedy, perl's regular expressions start matching from the left. This means that the left-most < char is found first, then the .*? part matches until the first (and only) >

    It is not 100% clear what you want to match in general, but for this string

    s/<[^<>]*>//;
    gives the desired result.

    the perlre manpage has more info about greedyness vs left-first matching. See section "Version 8 Regular Expressions", paragraph 7, or the section on backtracking (search for got <d is under the bar in the >)

Re: Regular expression seems to be greedy
by Cody Pendant (Prior) on Nov 09, 2004 at 21:52 UTC
    OK, first of all, what are those slashes doing before the < and > signs? They're not necessary. You might have inherited that from doing regexes on HTML? It looks like it. So get into the habit of not using // as your separation pattern. Use || or ## or {}{} instead.

    Second, this regex does exactly what you want it to, you've told it to find the first < it comes across and match, non-greedily, to the first > it comes across. That's why it's matching the way it is.

    $str =~ s|(.*)<.*?>|$1|; does what you want. Match, greedily, and keep, anything up to a < (greediness is your friend in this kind of situation).

    But then so does $str =~ s|</a>$||; if all you need is to take the closing anchor tag off the end. It's not possible to know exactly what you want from this example.



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
    =~y~b-v~a-z~s; print
Re: Regular expression seems to be greedy
by CountZero (Bishop) on Nov 09, 2004 at 22:07 UTC
    This is a good example of why one should not attack HTML or XML code with regexes but rather use some of the more sophisticated parsing modules on CPAN!

    And to boot, it is not even valid HTML: ex nihilo fit nihil as the Elder said.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      And to boot, it is not even valid HTML

      I think that's just another aspect of the same problem. Having got into trouble by attacking HTML with a regex...



      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
      =~y~b-v~a-z~s; print