svenXY has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I have a regex question. It is not a real problem, I'm just curious if it is possible at all:

It's about (non)greediness, left to right and regex replacements.

Before I post the code, I need to let you know that

  1. yes, I know that HTML code should not be parsed with regexes
  2. I only came across this problem when trying to solve this for someone else, but the regex question also applies to other examples

#!/usr/bin/perl use strict; use warnings; my $string =<<'EOF'; <tr> <td>aaa</td> <td>aaa</td> </tr> <tr> <td>NOTWANTED</td> </tr> <tr> <td>bbb</td> <td>bbb</td> </tr> EOF # try to remove all table rows with NOTWANTED $string =~ s/<tr.+?NOTWANTED.+?<\/tr>//gsm; print $string;

prints only the third table row. As far as I understand, the problem here is that regex starts with the leftmost "<tr" (the first one) and will find a (smallest) match that contains NOTWANTED and will remove it.

While this problem can easily be solved with
my @tr = split(/(?=<tr)/, $string); # split at <tr, but do not remove +<tr @tr = grep { ! /NOTWANTED/ } @tr; # remove the elements with NOTWANT +ED print join('', @tr);
I'm still curious if it can be done with a regex replacement.
Waiting for your comments,
svenXY

Replies are listed 'Best First'.
Re: greedy/nongreedy regex replacement
by ikegami (Patriarch) on Dec 05, 2005 at 19:18 UTC

    The greed removing modifier ? affects where the match ends, not where it starts. It always starts matching as early as possible. You want:

    s{<tr (?:(?!<tr).)* NOTWANTED .*? </tr>}{}xgs;

    (?:(?!<tr).)* reads as "0 or more characters which do not match the regex <tr". It is to regex what [^abc]* is to characters.

    By the way, I switched from s/// to s{}{} since / is a common characters in HTML. I also removed the m switch since you use neither ^ nor $.

Re: greedy/nongreedy regex replacement
by Roy Johnson (Monsignor) on Dec 05, 2005 at 19:22 UTC
    With a negative lookahead, yes:
    $string =~ s/<tr(?:(?!<tr>).)+?NOTWANTED.+?<\/tr>//gsm;
    For each char after the opening <tr, it makes sure that it's not the beginning of a closingnested tag.

    Caution: Contents may have been coded under pressure.
Re: greedy/nongreedy regex replacement
by svenXY (Deacon) on Dec 05, 2005 at 20:07 UTC
    cool! Thanks! I learned something new.
    Regards,
    svenXY