Re: Regex backreference problem.
by bart (Canon) on Oct 10, 2003 at 03:03 UTC
|
Well uh... yes. Backreferences don't work, not as such, in a character class. the "\1" is interpreted as "\001", a chr(1). Try it:
$_ = "a\001b";
m[(.)([^\1])] and print "$1|$2";
|b
Oops.
There could be a way, but I'm not sure how, using the (?{...}) or (??{...}) constructs. I don't think I'd do it like that. Instead, I'd go for:
$_ = 'aaab';
m[(.)(?!\1)(.)] and print "$1|$2";
a|b
| [reply] [d/l] [select] |
|
|
Yes. I'd kinda worked that out, and the lookahead assertion works ok, but I was really asking why don't they work in character classes? I mean there ae plenty of other ways of denoting a character, even the non-printable ones:
chr(0) => \cA, \01 \x01
It seems a shame to have to use a lookahead assertion and (.) to ([^\1])... and I was wondering if there was any reason other than "that's the way it is"?
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
I was wondering if there was any reason other than "that's the way it is"?
I can give you a few, hopefully informed guesses.
In your example, a backreference in a character class would seem to make sense, because you just matched one character. But what about longer strings? If your first group matched the string "a-z", would a character class with a backreference [\1] then have to match all lower case letters? Normal backreferences don't match as a regex, instead, they're substrings, and try to literally match what they're overlayed against. What if your pattern matched just a single backslash, surely you'd end up with an invalid regex? Or would you instead prefer, that this would match "a", "-" and "z" only?
In any case, clearly, you'd need to have instant regex compilation, per attempt of a match. That isn't very fast. But it gets worse.
A character class can typically be implemented using a bitmap (or bit array), with single byte characters, that's 256 bits. To compile a character class, you just mark all the characters that are acceptable. To match using such a character class, just check to see if this character's bit is set in its bit array.
This also would seem to indicate that compiling a character class likely won't be the fastest part in a regex compiler. It's pretty obvious that a test using such a character class would be a lot faster, than the compilation. Just a tip to compare apples and oranges.
| [reply] [d/l] |
Re: Regex backreference problem.
by Roger (Parson) on Oct 10, 2003 at 03:25 UTC
|
($_ = 'aaab') =~ m/(.)((?!\1).)/ and print "$1|$2";
Prints
a|b
Another cool way to do the same match is with the match-time pattern interpolation technique:
$_ = 'aaaabbbccddee';
while (m/(.)(.)(??{$1 eq $2})/g)
{
print "$1|$2\n";
}
Prints...
a|b
b|c
c|d
d|e
| [reply] [d/l] [select] |
|
|
Not so cool, but my way :)
perl -e'$_='aaabbbbccccdddd';print"$1|$2\n" while(m[(.)\1*(.)]g);'
| [reply] [d/l] |
Re: Regex backreference problem.
by Abigail-II (Bishop) on Oct 10, 2003 at 08:33 UTC
|
It's already explained why this happens, because the character
class is created at compile time. But this gives us the right
hint to "fix" the regex: use a delayed regex.
($_ = 'aaab') =~ m[(.)((??{"[^$1]"}))] and print "$1|$2";
a|b
Abigail | [reply] [d/l] |
|
|
| [reply] |
Re: Regex backreference problem.
by Anonymous Monk on Oct 10, 2003 at 07:25 UTC
|
Hi, I novice in perl but I do think to do something like it:
$_ = 'aaabbbbbbbbbbbbbccccccccccccccccccccc';
print "$1|$2\n" while (m[(.)(?>\1+)(.)]g);
Output:
a|b
b|c | [reply] [d/l] |
Re: Regex backreference problem.
by ambrus (Abbot) on Oct 10, 2003 at 10:29 UTC
|
I'd think of this:
($f)=~/(.)/; /(.)([^$f])/ and print $1,$2; | [reply] [d/l] |