Regular expression seems to be greedy

scogreen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular expression seems to be greedy by Eimi Metamorphoumai (Deacon) on Nov 09, 2004 at 21:40 UTC
Basically, greediness is only one consideration for how regexps find their matches, and in this case it's not affecting your results. What perl is doing is first looking for the first < character, then when it finds it, it looks forward (non-greedily) for the first matching >. There are a few ways around this, but I think what you might find best would be `$str = 'Some TextVenture</B Brothers</a>'; $str =~ s/<[^<]>//;` [download] Update:* See How will my regular expression match? for more details on why greediness isn't the only factor.	[reply] [d/l]
Re: Regular expression seems to be greedy by Joost (Canon) on Nov 09, 2004 at 21:47 UTC
Though the match between < and > is non-greedy, perl's regular expressions start matching from the left. This means that the left-most < char is found first, then the .? part matches until the first (and only) > It is not 100% clear what you want to match in general, but for this string `s/<[^<>]>//;` [download] gives the desired result. the perlre manpage has more info about greedyness vs left-first matching. See section "Version 8 Regular Expressions", paragraph 7, or the section on backtracking (search for `got <d is under the bar in the >`) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re: Regular expression seems to be greedy by Cody Pendant (Prior) on Nov 09, 2004 at 21:52 UTC
OK, first of all, what are those slashes doing before the < and > signs? They're not necessary. You might have inherited that from doing regexes on HTML? It looks like it. So get into the habit of not using // as your separation pattern. Use \|\| or ## or {}{} instead. Second, this regex does exactly what you want it to, you've told it to find the first < it comes across and match, non-greedily, to the first > it comes across. That's why it's matching the way it is. `$str =~ s\|(.)<.?>\|$1\|;` does what you want. Match, greedily, and keep, anything up to a < (greediness is your friend in this kind of situation). But then so does `$str =~ s\|</a>$\|\|;` if all you need is to take the closing anchor tag off the end. It's not possible to know exactly what you want from this example. ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l] [select]
Re: Regular expression seems to be greedy by CountZero (Bishop) on Nov 09, 2004 at 22:07 UTC
This is a good example of why one should not attack HTML or XML code with regexes but rather use some of the more sophisticated parsing modules on CPAN! And to boot, it is not even valid HTML: ex nihilo fit nihil as the Elder said. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^2: Regular expression seems to be greedy by Cody Pendant (Prior) on Nov 10, 2004 at 00:19 UTC
And to boot, it is not even valid HTML I think that's just another aspect of the same problem. Having got into trouble by attacking HTML with a regex... ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply]