chimni has asked for the wisdom of the Perl Monks concerning the following question:


Can anyone explain how this regex works.
What i understand is
-It's a conditional regex that searches for words of the form "$x$x" or "$x$y$y$x":
% simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words output : beriberi coco couscous deed ... toot toto tutu

1 -Is'nt the second \w+ useless as the first one would match all words
2-This is the conditional regex (?(2)\2\1|\1).
What is the the condition (2)? and secondly what is the ? before the condition .Is that to specify minimal match?
thanks
chimni

Replies are listed 'Best First'.
Re: Conditional regex
by PodMaster (Abbot) on Jan 12, 2004 at 10:17 UTC
    use YAPE::Regex::Explain; die YAPE::Regex::Explain->new( '^(\w+)(\w+)?(?(2)\2\1|\1)$' )->explain; __END__ The regular expression: (?-imsx:^(\w+)(\w+)?(?(2)\2\1|\1)$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ( group and capture to \2 (optional (matching the most amount possible)): ---------------------------------------------------------------------- \w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- )? end of \2 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \2) ---------------------------------------------------------------------- (?(2) if back-reference \2 matched, then: ---------------------------------------------------------------------- \2 what was matched by capture \2 ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- | else: ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- ) end of conditional on \2 ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- E:\>perl -Mre=debug -le"print 66666 if q[aa] =~ /^(\w+)(\w+)?(?(2)\2\1 +|\1)$/" Freeing REx: `,' Compiling REx `^(\w+)(\w+)?(?(2)\2\1|\1)$' size 34 first at 2 synthetic stclass `ANYOF[0-9A-Z_a-z]'. 1: BOL(2) 2: OPEN1(4) 4: PLUS(6) 5: ALNUM(0) 6: CLOSE1(8) 8: CURLYX[1] {0,1}(17) 10: OPEN2(12) 12: PLUS(14) 13: ALNUM(0) 14: CLOSE2(16) 16: WHILEM(0) 17: NOTHING(18) 18: GROUPP2(20) 20: IFTHEN(28) 22: REF2(24) 24: REF1(33) 26: LONGJMP(32) 28: IFTHEN(32) 30: REF1(33) 32: TAIL(33) 33: EOL(34) 34: END(0) floating `'$ at 1..2147483647 (checking floating) stclass `ANYOF[0-9A- +Z_a-z]' anchored(BOL) minlen 1 Guessing start of match, REx `^(\w+)(\w+)?(?(2)\2\1|\1)$' against `aa' +... Found floating substr `'$ at offset 2... Does not contradict STCLASS... Guessed: match at offset 0 Matching REx `^(\w+)(\w+)?(?(2)\2\1|\1)$' against `aa' Setting an EVAL scope, savestack=3 0 <> <aa> | 1: BOL 0 <> <aa> | 2: OPEN1 0 <> <aa> | 4: PLUS ALNUM can match 2 times out of 32767... Setting an EVAL scope, savestack=3 2 <aa> <> | 6: CLOSE1 2 <aa> <> | 8: CURLYX[1] {0,1} 2 <aa> <> | 16: WHILEM 0 out of 0..1 cc=140fb88 Setting an EVAL scope, savestack=8 2 <aa> <> | 10: OPEN2 2 <aa> <> | 12: PLUS ALNUM can match 0 times out of 32767... Setting an EVAL scope, savestack=8 failed... restoring \2..\2 to undef failed, try continuation... 2 <aa> <> | 17: NOTHING 2 <aa> <> | 18: GROUPP2 2 <aa> <> | 20: IFTHEN 2 <aa> <> | 30: REF1 failed... failed... failed... 1 <a> <a> | 6: CLOSE1 1 <a> <a> | 8: CURLYX[1] {0,1} 1 <a> <a> | 16: WHILEM 0 out of 0..1 cc=140fb88 Setting an EVAL scope, savestack=8 1 <a> <a> | 10: OPEN2 1 <a> <a> | 12: PLUS ALNUM can match 1 times out of 32767... Setting an EVAL scope, savestack=8 2 <aa> <> | 14: CLOSE2 2 <aa> <> | 16: WHILEM 1 out of 0..1 cc=140fb88 2 <aa> <> | 17: NOTHING 2 <aa> <> | 18: GROUPP2 2 <aa> <> | 20: IFTHEN 2 <aa> <> | 22: REF2 failed... failed... failed... restoring \2..\2 to undef failed, try continuation... 1 <a> <a> | 17: NOTHING 1 <a> <a> | 18: GROUPP2 1 <a> <a> | 20: IFTHEN 1 <a> <a> | 30: REF1 2 <aa> <> | 33: EOL 2 <aa> <> | 34: END Match successful! 66666 Freeing REx: `^(\w+)(\w+)?(?(2)\2\1|\1)$'
    If you look at the output of -Mre=debug, you'll see that the REx engine first matches $1='aa', then backtracks until $1='a', so that it can then match \1. You can read more about backtracking in perlre.

    I wouldn't say the 2nd \w+ is useless because the intent seems to be to try to match stuff like "OneTwoTwoOne".

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Conditional regex
by Abigail-II (Bishop) on Jan 12, 2004 at 10:05 UTC
    Is'nt the second \w+ useless as the first one would match all words
    No. Perl will try to match the entire regex, backtracking and reducing the match for \w+ if necessary.
    This is the conditional regex (?(2)\2\1|\1). What is the the condition (2)? and secondly what is the ? before the condition .Is that to specify minimal match?
    The condition (2) is true if the second set of parens matched (that is, if the second (\w+) matched), else, it's false. The ? isn't a token by itself - it's part of entire (?(condition)yes-pattern|no-pattern) syntactical construct. It's just like while, where the h doesn't carry a meaning by itself either.

    Abigail

Re: Conditional regex
by runrig (Abbot) on Jan 12, 2004 at 10:30 UTC
    The (?(2)...) conditional turns out to be useless, because if the second set of parens don't match, then \2\1 is the same as \1, so it may as well just be \2\1 (without the conditional logic). And the (\w+)? may as well be (\w*), I think.

    Update: In an earlier version of perl, the parens in (\w+)? may actually capture something even when the whole (\w+)? matches the empty string, which may be the reason for the conditional, but changing it to (\w*) would make the conditional totally unnecessary.

Re: Conditional regex
by Hena (Friar) on Jan 12, 2004 at 10:05 UTC
    Well, i can't figure it out myself, but add use re "debug"; in the script and it will cause regex to print out debug, which might give more clues.

    Ps. The (2) checks whether the second backreference exists. Why it isn't (\2) i don't know (it just reads so in camel book:)).
Re: Conditional regex
by Roger (Parson) on Jan 12, 2004 at 12:21 UTC
    You probably want this instead ...
    #!/usr/local/bin/perl -w use strict; my @words = qw/ beriberi coco couscous deed toot toto tutu assa saaa /; for (@words) { print "$_\n" if /^(\w+)\1|(\w+)(\w+)\3\2$/ # $x$x | $x$y$y$x }
      /^(\w+)\1|(\w+)(\w+)\3\2$/ matches ssaa.

      Abigail

        But I thought 'ssaa' matches the '$x$y' pattern, where $x is 'ss' and $y is 'aa', that still fits the requirement?

        Thanks Abigail-II, I have realized my mistake. Here's an updated version that works -
        #!/usr/local/bin/perl -w use strict; my @words = qw/ beriberi coco couscous deed toot toto tutu assa ssaa /; for (@words) { print "$_\n" if /^(\w+)\1$/ || /^(\w+)(\w+)\2\1$/; }

        And output -
        beriberi coco couscous deed toot toto tutu assa

Re: Conditional regex
by Roy Johnson (Monsignor) on Jan 12, 2004 at 15:45 UTC
    Seems like it could be a little more straightforward:
    /^(\w+)(\w*)\2\1$/

    The PerlMonk tr/// Advocate