in reply to Re: Regex backreference problem.
in thread Regex backreference problem.

Yes. I'd kinda worked that out, and the lookahead assertion works ok, but I was really asking why don't they work in character classes? I mean there ae plenty of other ways of denoting a character, even the non-printable ones:

chr(0) => \cA, \01 \x01

It seems a shame to have to use a lookahead assertion and (.) to ([^\1])... and I was wondering if there was any reason other than "that's the way it is"?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Replies are listed 'Best First'.
Re^3: Regex backreference problem. (compile-time)
by tye (Sage) on Oct 10, 2003 at 06:57 UTC

    Because character classes are determined when the regex is compiled (which is different than when the Perl statement that contains the regex is compiled). There is no regex 'node' for "character class that consists of these hard-coded characters plus the characters in this backreference". The only character class regex node type is "hard-coded list of characters" that was built when the regex was compiled (not after it ran part way and figured out what $1 might end up being).

                    - tye

      That makes sense. Thanks.

      I still wish there was an easier way to say "Don't match this character (or this constant) here, but consume the appropriate number of characters".

      ((?!something).{length_of_something})

      Works okay whenyou know the length of something, but if something comes from a backreference, then you don't (always).

      While I'm wishing, I'd also like it if lookbehinds didn;t prejudge the issue of whether the it was variable length. I tried to do

      ([abc]) .* (?<!\1)(something)

      but I guess that this is teh same issue. It doesn't know that \1 is fixed length -- even though it has already seen the capture parens and could determine that it is -- as this regex could be incorporated into another which contained another set of captures which precede the one seen, and shifted the goal posts as it were.

      {Sigh} Maybe in P6, capture parens to $1, $2 etc. will be done away with in favour of a capture to named variables. You can do this now with (?{ $var - $^N }) which is useful, but it has a bad effect on performance.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Re: Re: Re: Regex backreference problem.
by bart (Canon) on Oct 10, 2003 at 21:54 UTC
    I was wondering if there was any reason other than "that's the way it is"?
    I can give you a few, hopefully informed guesses.

    In your example, a backreference in a character class would seem to make sense, because you just matched one character. But what about longer strings? If your first group matched the string "a-z", would a character class with a backreference [\1] then have to match all lower case letters? Normal backreferences don't match as a regex, instead, they're substrings, and try to literally match what they're overlayed against.

    What if your pattern matched just a single backslash, surely you'd end up with an invalid regex? Or would you instead prefer, that this would match "a", "-" and "z" only?

    In any case, clearly, you'd need to have instant regex compilation, per attempt of a match. That isn't very fast. But it gets worse.

    A character class can typically be implemented using a bitmap (or bit array), with single byte characters, that's 256 bits. To compile a character class, you just mark all the characters that are acceptable. To match using such a character class, just check to see if this character's bit is set in its bit array.

    This also would seem to indicate that compiling a character class likely won't be the fastest part in a regex compiler. It's pretty obvious that a test using such a character class would be a lot faster, than the compilation. Just a tip to compare apples and oranges.