wine has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Perhaps I'm not seeing the obvious mistake in my regexp-code, but I think I found some strange behavior of the regexpr engine:

wine@localhost]$ perl -pe 's/(")[^\1]*\1/check/g' "strings 1" "string 2" # input check #output

I hope you'll agree this is wrong. It should yield: check check. If you transscribe the expression by removing the backreferences you get the normal behavior:

wine@localhost]$ perl -pe 's/"[^"]*"/check/g' "string 1" "string 2" check check

Using $1 yields:

wine@localhost]$ perl -pe 's/(")[^$1]*$1/check/g' "string 1" "string 2" Unmatched [ before HERE mark in regex m/(")[ << HERE ^]*/ at -e line 1 +, <> line 1.

This isn't related to the ". It doesn't work with normal alpha-characters either:

wine@localhost]$ perl -pe 's/(a)[^\1]*\1/check/g' abbbba abbbbbbbba check

Using a | helps if the two side differ:

# ("|'"'"') is a complex bash escape for ("|') wine@localhost]$ perl -pe 's/("|'"'"')[^\1]*\1/check/g' "aaa" 'aaa' check check

But if you use differt input it goes wrong again, and this should really not happen:

perl -pe 's/("|'"'"')[^\1]*\1/check/g' "aaa" "aaa" check

Can someone explain this behavior. (").*?\1 seems to work, but is not compliant with lets say sed, which seems to have the same problem:

wine@localhost]$ sed 's/\(a\)[^\1]*\1/check/g' abbbba abbbbbbbba check

Thank you in advance

wine

Replies are listed 'Best First'.
Re: Strange regexp behavior
by Chmrr (Vicar) on Mar 26, 2002 at 12:50 UTC

    You're assuming that \1 is magical inside of character classes -- I don't think it is. For example, the following works, and does what you want it to do:

    perl -pe 's/(")((?!\1).)*\1/check/g'

    I don't remember how to tell Perl to spit out how it parsed the regex, but that might be useful in proving my point, so I think I'll do that now. ;>

    Update: Ayup. Here's what perl says about your regex:

    Compiling REx `(")[^\1]*\1' size 19 first at 3 1: OPEN1(3) 3: EXACT <">(5) 5: CLOSE1(7) 7: STAR(17) 8: ANYOF[\0\2-\377](0) 17: REF1(19) 19: END(0)

    Notice that it's taking the \1 to be chr(1), not a backref to the 1st value captured. By contrast, using negative look-ahead:

    Compiling REx `(")((?!\1).)*\1' size 24 first at 3 1: OPEN1(3) 3: EXACT <">(5) 5: CLOSE1(7) 7: CURLYM[2] {0,32767}(22) 11: UNLESSM[-0](17) 13: REF1(15) 15: SUCCEED(0) 16: TAIL(17) 17: REG_ANY(20) 20: SUCCEED(0) 21: NOTHING(22) 22: REF1(24) 24: END(0)

    Well, it's not quite so clear what it's doing anymore (at least to me) But it works, eh?

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

      chmrr++!

      I was trying to figure out the same thing, but couldn't remember how to decompile the regex.
      --
      Mike

        The easiest way is to use re 'debug'; You'll find that and much more under "Debugging regular expressions" in perldebguts.

        perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Strange regexp behavior
by jepri (Parson) on Mar 26, 2002 at 12:52 UTC
    It is written in perlre that all characters inside a character class are interpreted literally. Slashes have no effect, so you can't backreference with them. Try putting a # char in there...

    Update: Aim at foot... special escapes are respected. Hmmm. Backreferences within a character class would be interesting.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

      Several people pointed out that [^\1] is not interpretated as I expected, i.e. "not the content of reference 1". But a \t is interpreted as a "tab", which you would expect. I guess a was fooled by that behavior.

      wine

Re: Strange regexp behavior
by RMGir (Prior) on Mar 26, 2002 at 12:55 UTC
    The problem is that [^\1] doesn't mean what you think it does.

    perl -ne's/(.)([^\1]*)/print "$1-$2\n"/eg;' A234 A-234 AA A-A A-\ A--\ A-1 A--1 A-A A--A A1 A-1
    See, the 2nd paren will match whatever it wants to. I'm not sure WHAT [^\1] means; it might be "anything but chr(1)", but I'm not sure.
    --
    Mike
Re: Strange regexp behavior
by pizza_milkshake (Monk) on Mar 26, 2002 at 21:24 UTC
    regular expressions are greedy. type
    perldoc perlre /greed<enter>
    
    
    perl -MLWP::Simple -e'getprint "http://parseerror.com/p"' |less

      Yes, but this regexp is not bitten by that. It would be, if the author had used (").*\1 instead. If you write s/"[^"]+"/--/g then:

      "word1" "word2"
      "word1" "word2
      "word1 "word2"

      will become

      -- --
      -- "word2
      --word2"

      --
      Alper Ersoy