Backreferences in negated character classes

bobf has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to construct a regex that will match the pattern 'XXY', where X and Y can be any word character, but X and Y must be different characters. In English, I want to find all occurrences where a given character is duplicated and is followed by a different character.

Attempt 1: At first I thought this would be relatively straightforward by using a backreference in a negated character class, but I seem to be missing something. Could someone please explain why the code below doesn't DWIM?

use strict;
use warnings;

my $string = 'ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ';
print "matching: $string\n";

while( $string =~ m/((\w)\2[^\2])/g )
{
    print $1, "\n";
}

=pod
matching: ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ
EEF
JJJ
OOO
XXY
=cut
[download]

The regex is matching 'EEF' and 'XXY', which are correct, but it is also matching 'JJJ' and 'OOO'. The negated character class isn't acting how I expected.

Attempt 2: I also tried using a negative lookahead assertion, but also without success:

while( $string =~ m/((\w)\2($!\2)\w)/g )
{
    print $1, "\n";
}
=pod
matching: ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ
JJJK
OOOO
=cut
[download]

This regex matches four characters rather than the 3 I expected (since the lookahead is zero-width), and it also lacks specificity at the last position (matching 'OOOO').

The whole story: Understanding this problem is only part of my goal. I'm actually trying to match 'AABCCCCAD'. My first attempt was this:

$string = 'WXYYZAABCCCCADWWXYYYZ';

while( $string =~ m/((\w)\2([^\2])([^\2\3]){4}\1[^\2\3\4])/g )
{
    print $1, "\n";
}
[download]

but, given my first question, this obviously doesn't work.

Educate me in the ways of thine regexen, that I might faithfully wield their power.

Many thanks in advance.

Comment on Backreferences in negated character classes Select or Download Code

Replies are listed 'Best First'.
Re: Backreferences in negated character classes (two) by tye (Sage) on Dec 21, 2005 at 08:03 UTC
`[^\2]` matches any character that isn't CTRL-B (assuming ASCII). And negative look-aheads use (?!...) not ($!...). - tye	[reply] [d/l]
Re: Backreferences in negated character classes by GrandFather (Saint) on Dec 21, 2005 at 07:53 UTC
Is this what you want? `use strict; use warnings; my $string = 'ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ'; print $1, "\n" while $string =~ m/(([\w])(?=\2).(?!\2).)/g;` [download] Prints: `EEF JJK OOP XXY` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Backreferences in negated character classes by GrandFather (Saint) on Dec 21, 2005 at 08:18 UTC
Didn't notice the interesting bit :). Extended to handle AABCCCCAD: `use strict; use warnings; my $string = 'WXYYZAABCCCCADWWXYYYZAAbCCCCAd'; print $1, "\n" while $string =~ m/( ([\w])(?=\2).(?!\2)(.) # Match AAB (?!\2\|\3)(.) # C \4{3} # CCC \2 # A (?!\2\|\3\|\4). # D )/xg;` [download] Prints: `AABCCCCAD AAbCCCCAd` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Backreferences in negated character classes by hv (Prior) on Dec 21, 2005 at 10:55 UTC
The first problem is that character classes are constructed when the regexp is compiled, and do not change during the matching process. Because of that the special syntax for backreferences in regexps does not extend inside the character class, so as tye mentioned the '`\2`' is actually treated as ASCII character 2. You could circumvent that by getting clever with deferred evals (which lets you create new regexps to be compiled while matching), but you don't want to do that - the negative lookahead is definitely the way to go. All of the extended features in the regexp engine are of the form `(?...)`, to avoid clashing with any previously valid syntax; the `($!\2)` in your example actually interpolated the `$!` error variable into your regexp - presumably the empty string. So a correct solution would look something like: `m{ (\w) \1 (?! \1) (\w) }x;` [download] .. and to the extended example: `m{ (\w) \1 (?! \1) (\w) (?! \1 \| \2) (\w) \3 \3 \3 \1 (?! \1 \| \2 \| \3) (\w) }x;` [download] This gives you something nice and regular - it would be quite easy to write code to generate the above from the example string. Here's how it might work: `my $s = 'AABCCCCAD'; our $DEBUG = 1; print +($s =~ mkre($s)) ? "ok\n" : "fail\n"; sub mkre { my $s = shift; my $index = 0; my(%seen, @elems); for (split //, $s) { if ($seen{$_}) { push @elems, "\\$seen{$_}"; } else { push @elems, sprintf '(?! %s)', join ' \| ', map "\\$_", 1 .. $in +dex if $index; $seen{$_} = ++$index; push @elems, '(\\w)'; } } my $re = join ' ', @elems; warn "$s: $re\n" if $DEBUG; qr/$re/x; }` [download] Hugo	[reply] [d/l] [select]
Re^2: Backreferences in negated character classes by bobf (Monsignor) on Dec 21, 2005 at 15:22 UTC
Many thanks for all of the responses. GrandFather, you were right on target as always; tye, thanks for cutting to the heart of the problem and pointing out a duh moment for me. :-) The first problem is that character classes are constructed when the regexp is compiled, and do not change during the matching process. Because of that the special syntax for backreferences in regexps does not extend inside the character class, so as tye mentioned the '\2' is actually treated as ASCII character 2. the ($!\2) in your example actually interpolated the $! error variable into your regexp Thank you for stating that so explicitly - that was a core piece of knowledge that I was missing. Now I understand why my incorrect negative lookahead was matching four characters. I can't believe I missed the obvious typo in the lookahead (`$!` instead of `?!`. I guess that's what I get for playing with regexen so late at night. :-) This gives you something nice and regular - it would be quite easy to write code to generate the above from the example string. Here's how it might work: Thanks for the great example for building this type of regex on the fly. I wanted to capture the whole match, so I changed it as follows: my $regex = mkre($s); while( $string =~ m/$regex/g ) { print $1, "\n"; # do other stuff } sub mkre { my $s = shift; my $index = 1; # using \1 to capture the whole match my(%seen, @elems); for (split //, $s) { if ($seen{$_}) { push @elems, "\\$seen{$_}"; } else { push @elems, sprintf '(?! %s)', join ' \| ', map "\\$_", 2 .. $in +dex if $index > 1; # changed to start with \2 $seen{$_} = ++$index; push @elems, '(\\w)'; } } my $re = join( ' ', '(', @elems, ')' ); # create \1 warn "$s: $re\n" if $DEBUG; qr/$re/x; } [download] Then I realized I could have left the sub as-is and just printed `$&` instead. :-) Thanks again for the help, and for such a elegant solution. Update: japhy++ Very nice solution - taking that approach would enable me to create much more flexible (and more powerful) regexps. Thanks for posting it.	[reply] [d/l] [select]
Re: Backreferences in negated character classes by japhy (Canon) on Dec 21, 2005 at 14:47 UTC
This code is similar to hugo's, but with a couple key differences. First, it matches anything that isn't one of the backreferences (instead of specifically `\w`), although you could change that behavior very easily. Second, it only transforms CAPITAL letters into the complex pattern, leaving lowercase letters alone. This allows you to turn "aTTeMPT" into a regex that matches words with the pattern of the word "attempt", such as "asserts" and "assents". It doesn't limit the capital letters to match letters that are shown in lowercase, but again, that's something you can add fairly easily. `sub make_pattern { my $w = shift; my %seen; $w =~ s{([A-Z])}{ if ($seen{$1}) { "\\$seen{$1}" } else { $seen{$1} = 1 + keys %seen; "(" . join("", map "(?!\\$_)", 1 .. ($seen{$1} - 1)) . ".)" } }ge; return qr/^$w$/si; }` [download] Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]