BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Why doesn't this do what I expected it to? I expected that it would match at the first transition between two none-alike characters -- in this case, the 'ab' pair.

($_ = 'aaab') =~ m[(.)([^\1])] and print "$1|$2"; a|a

Seems to be my day for missing the obvious?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Replies are listed 'Best First'.
Re: Regex backreference problem.
by bart (Canon) on Oct 10, 2003 at 03:03 UTC
    Well uh... yes. Backreferences don't work, not as such, in a character class. the "\1" is interpreted as "\001", a chr(1). Try it:
    $_ = "a\001b"; m[(.)([^\1])] and print "$1|$2";
    
    |b
    
    Oops.

    There could be a way, but I'm not sure how, using the (?{...}) or (??{...}) constructs. I don't think I'd do it like that. Instead, I'd go for:

    $_ = 'aaab'; m[(.)(?!\1)(.)] and print "$1|$2";
    
    a|b
    

      Yes. I'd kinda worked that out, and the lookahead assertion works ok, but I was really asking why don't they work in character classes? I mean there ae plenty of other ways of denoting a character, even the non-printable ones:

      chr(0) => \cA, \01 \x01

      It seems a shame to have to use a lookahead assertion and (.) to ([^\1])... and I was wondering if there was any reason other than "that's the way it is"?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

        Because character classes are determined when the regex is compiled (which is different than when the Perl statement that contains the regex is compiled). There is no regex 'node' for "character class that consists of these hard-coded characters plus the characters in this backreference". The only character class regex node type is "hard-coded list of characters" that was built when the regex was compiled (not after it ran part way and figured out what $1 might end up being).

                        - tye
        I was wondering if there was any reason other than "that's the way it is"?
        I can give you a few, hopefully informed guesses.

        In your example, a backreference in a character class would seem to make sense, because you just matched one character. But what about longer strings? If your first group matched the string "a-z", would a character class with a backreference [\1] then have to match all lower case letters? Normal backreferences don't match as a regex, instead, they're substrings, and try to literally match what they're overlayed against.

        What if your pattern matched just a single backslash, surely you'd end up with an invalid regex? Or would you instead prefer, that this would match "a", "-" and "z" only?

        In any case, clearly, you'd need to have instant regex compilation, per attempt of a match. That isn't very fast. But it gets worse.

        A character class can typically be implemented using a bitmap (or bit array), with single byte characters, that's 256 bits. To compile a character class, you just mark all the characters that are acceptable. To match using such a character class, just check to see if this character's bit is set in its bit array.

        This also would seem to indicate that compiling a character class likely won't be the fastest part in a regex compiler. It's pretty obvious that a test using such a character class would be a lot faster, than the compilation. Just a tip to compare apples and oranges.

Re: Regex backreference problem.
by Roger (Parson) on Oct 10, 2003 at 03:25 UTC
    ($_ = 'aaab') =~ m/(.)((?!\1).)/ and print "$1|$2";
    Prints
    a|b
    Another cool way to do the same match is with the match-time pattern interpolation technique:
    $_ = 'aaaabbbccddee'; while (m/(.)(.)(??{$1 eq $2})/g) { print "$1|$2\n"; }
    Prints...
    a|b b|c c|d d|e
      Not so cool, but my way :)
      perl -e'$_='aaabbbbccccdddd';print"$1|$2\n" while(m[(.)\1*(.)]g);'
Re: Regex backreference problem.
by Abigail-II (Bishop) on Oct 10, 2003 at 08:33 UTC
    It's already explained why this happens, because the character class is created at compile time. But this gives us the right hint to "fix" the regex: use a delayed regex.
    ($_ = 'aaab') =~ m[(.)((??{"[^$1]"}))] and print "$1|$2"; a|b

    Abigail

      That works too. The only problem with that was this came up whilst I was trying to tackle your "N-Queens with pure regex" problem. I thought I was onto a more efficient solution, but......it didn't work ;)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Re: Regex backreference problem.
by Anonymous Monk on Oct 10, 2003 at 07:25 UTC
    Hi, I novice in perl but I do think to do something like it:
    $_ = 'aaabbbbbbbbbbbbbccccccccccccccccccccc'; print "$1|$2\n" while (m[(.)(?>\1+)(.)]g);
    Output: a|b b|c
Re: Regex backreference problem.
by ambrus (Abbot) on Oct 10, 2003 at 10:29 UTC
    I'd think of this: ($f)=~/(.)/; /(.)([^$f])/ and print $1,$2;