I'm trying to construct a regex that will match the pattern 'XXY', where X and Y can be any word character, but X and Y must be different characters. In English, I want to find all occurrences where a given character is duplicated and is followed by a different character.
Attempt 1: At first I thought this would be relatively straightforward by using a backreference in a negated character class, but I seem to be missing something. Could someone please explain why the code below doesn't DWIM?
use strict; use warnings; my $string = 'ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ'; print "matching: $string\n"; while( $string =~ m/((\w)\2[^\2])/g ) { print $1, "\n"; } =pod matching: ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ EEF JJJ OOO XXY =cut
The regex is matching 'EEF' and 'XXY', which are correct, but it is also matching 'JJJ' and 'OOO'. The negated character class isn't acting how I expected.
Attempt 2: I also tried using a negative lookahead assertion, but also without success:
while( $string =~ m/((\w)\2($!\2)\w)/g ) { print $1, "\n"; } =pod matching: ABCDEEFGHIJJJKLMNOOOOPQRSTUVWXXYZ JJJK OOOO =cut
This regex matches four characters rather than the 3 I expected (since the lookahead is zero-width), and it also lacks specificity at the last position (matching 'OOOO').
The whole story: Understanding this problem is only part of my goal. I'm actually trying to match 'AABCCCCAD'. My first attempt was this:
but, given my first question, this obviously doesn't work.$string = 'WXYYZAABCCCCADWWXYYYZ'; while( $string =~ m/((\w)\2([^\2])([^\2\3]){4}\1[^\2\3\4])/g ) { print $1, "\n"; }
Educate me in the ways of thine regexen, that I might faithfully wield their power.
Many thanks in advance.
In reply to Backreferences in negated character classes by bobf
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |