Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am trying to write a script to take input in the form of "ABBCDE", and output all words which match that pattern. For example, "LOOKED" matches (note the repeated 'O', corresponding to the repeated 'B', and that all the other characters are mapped uniquely). Presently, my script takes the input an generates a regex. This is what it comes up with:

/^(\w)([^\1])(\2)([^\1\2\3])([^\1\2\3\4])([^\1\2\3\4\5])\s/

What I was trying to express with this is:

  1. . Any character...
  2. . ...followed by anything that isn't the first character...
  3. . ...followed by what was found in (2)
  4. . ...followed by anything that isn't the first, second, or third match...

And so forth. The problem is that this regex also matches words like "SEEMED". This is bad, since that third E doesn't correspond to the input pattern. I am unsure why my regex is failing. Can anyone shed some light?

- Anonymous Monk

Replies are listed 'Best First'.
Re: Matching Character Patterns With Backreferences
by Eimi Metamorphoumai (Deacon) on Dec 05, 2004 at 22:09 UTC
    I believe the backreferences don't work inside character classes. But you can use them inside negative lookahead assertions, thus:
    /^(\w)(?!.{0,4}\1)(\w)\2(?!.{0,2}\2)(\w)(?!.{0,1}\3)(\w)(?!\4)(\w)\s/;
    That is, find one letter and make sure it doesn't recur in the rest of the pattern. Then find the next, and make sure it's repeated immediately after, but not later on. And so on.
Re: Matching Character Patterns With Backreferences
by Anonymous Monk on Dec 05, 2004 at 22:12 UTC

    The problem is that [^\1] is interpreted as [^\001] and \001 means ascii charachter 1 in octal format.

    A correct regex would be: (using look-aheads):

    m/^ # Match the start of the line (\w) # Match the first letter, capture it in $1 (or \1 if you +re-use it in the regex) ( # Open capturing group for $2 (?!\1) # Look ahead and verify that the next charachter isn't +the same as the first one \w # Capture a word-charachter (because of the look ahead +this can't be the same one as the first) ) # Close capture $2 \2 # Match the second charachter another time (no need to ca +pture this) ( # Open capture $3 (?! # Open look-ahead \1 # Charachter 1 | # Or (don't use a charachter class, else you run in +the same trouble as before) \2 # Charachter 2 ) # Close look-ahed \w # Match a word-charachter ) # Close capture $3 ... /x;

    I'm sure you see the pattern in this so I'm not going to go on explenaning.

    The complete regex would be (tested against LOOKED and SEEMED):
    m/^ (\w) ((?!\1)\w)\2 ((?!\1|\2)\w) ((?!\1|\2|\3)\w) ((?!\1|\2|\3|\4)\w) \s$/x
    x-modifier is used to make the groups a little bit more visible

      Beautiful, thank you fellow Anonymous Monk. I wish the behaviour of \1 inside a character class was better documented. Does anyone know of a good listing of all these kinds of gotchas?

        You might try perldoc perltrap which is a pretty good centralized source of 'gotchas' to avoid.

Re: Matching Character Patterns With Backreferences
by davido (Cardinal) on Dec 06, 2004 at 03:41 UTC

    You can construct character classes (and negated character classes) based on backreferences on the fly with the (??{ .... }) regexp subexpression. Here's an example:

    use strict; use warnings; my $string = "that bab"; print "Match: $1 => $2" if $string =~ m/(that).+((??{ '[' . $1 . ']' }))/; __OUTPUT__ Match: that => a

    The above example matches 'that', and then uses the characters in 'that' as characters in a character class constructed within the (??{...}) subexpression. Since 'a' was one of the characters within 'that', it is matched, and captured into $2.

    This gets a little tripped up if the first character held in the backreference is also a character class metacharacter, such as ^ (which would construct a negated character class), or if there is an embedded hyphen, which would attempt to construct a range, as in 'a-z'. And be careful with embedded backslashes because within a character class they might look like '\d' (the combination used for matching any digit character. So this technique is somewhat sensitive to the type of data you're looking at, but with reasonably sanitary strings it can be effective.


    Dave

Re: Matching Character Patterns With Backreferences
by gaal (Parson) on Dec 05, 2004 at 21:02 UTC
    I'm not a regexp wizard, but playing around with this I've found that if you use $1 and friends, they are populated until a submatch fails (this does not happen with \1 form). No idea why this doesn't fail the whole expression, but at least it suggests a workaround until someone explains what's really going on:

    my @matches = $word =~ /^(\w)([^$1])($2)([^$1$2$3])([^$1$2$3$4])([^$1$ +2$3$4$5])/; print $word if @matches == 6; # or length $word, whatever.

    BTW, what's that \s doing in your regexp?

      I'm having trouble getting this to work with the $1 form - it's simply not matching anything the moment I make that change.

      The \s is simply because each line of the dictionary file contains a word followed by whitespace. Checking for the space stops me from printing out "TESTING" when I only want "TEST" :-)

        You should be using \1 and friends in the left hand side of the s/// operator.
Re: Matching Character Patterns With Backreferences
by dimar (Curate) on Dec 06, 2004 at 07:04 UTC

    Sometimes the most "fun" code to ponder is actually the worst code to maintain, or try to figure out after you haven't seen it for a long time.

    I think this thread is a case in point. It is definitely *informative* to learn how to solve this problem using a RegEx, but that begs the question of whether you really *want* to use a RegEx.

    I wouldn't, because, as you said yourself, it looks a bit ugly. It seems easier to just recognize that the word "ABBCBD" is a simple substitution cypher for "SEEMED" ... and then just try to write a very simple subroutine that helps us find different ways to "decrypt" our pattern "ABBCBD" (or whatever your pattern is) using an English dictionary.

    It is less fun to do it the non RegEx way, but (arguably) easier to understand and debug (YMMV).

    Here's one way to do it ... TMTOWTDI.

    use strict; use warnings; my $sPattern; my @aPlainText = qw(looked seemed liked cooked booked baked balked); ### $sPattern = "ABBCBD"; ### try this one later $sPattern = "ABBCED"; while(<@aPlainText>){ print "$_\n" if IsEqual($sPattern,$_); } sub IsEqual { my $sPattern = shift || die "missing argument"; my $sPlain = shift || die "missing argument"; my $iLen = 0; my %hSubstit1 = (); my %hSubstit2 = (); if(length($sPattern) != length($sPlain)){return 0;} else{$iLen = length($sPattern)-1}; for (0 .. $iLen){ my $chrCiphr = lc(substr($sPattern,$_,1)); my $chrPlain = lc(substr($sPlain,$_,1)); ### require a 1 to 1 mapping between plain chars and pattern c +hars $hSubstit1{$chrPlain} = $chrCiphr unless exists $hSubstit1 +{$chrPlain}; $hSubstit2{$chrCiphr} = $chrPlain unless exists $hSubstit2 +{$chrCiphr}; if($chrCiphr ne $hSubstit1{$chrPlain }){return 0;} if($chrPlain ne $hSubstit2{$chrCiphr }){return 0;} } return(1); }###end_sub
Re: Matching Character Patterns With Backreferences
by Anonymous Monk on Dec 05, 2004 at 20:53 UTC

    I just realized I should probably paste the code I had that generated that.I should mention that $d is the path to the dictionary file (with the words to match) and $p is the input pattern (ABBCDE above).

    This is pretty ugly

    my %charhash; @letters = split(//,$p); $string = ""; $invert = ""; $count = 1; foreach $letter (@letters) { if(defined $charhash{$letter}) { $string.="(\\$charhash{$letter})"; } else { $charhash{$letter} = $count; if($invert) { $string .= "([^$invert])"; } else { $string .= "(\\w)"; } } $invert .= "\\$count"; $count++; } print "Searching for pattern /^$string\\s/\n"; open DICT, $d; while(<DICT>) { print if /^$string\s/i; }
•Re: Matching Character Patterns With Backreferences
by merlyn (Sage) on Dec 06, 2004 at 12:05 UTC
    Here's my implementation of that from 1999. Lowercase letters stood for themselves. Uppercase letters were the wild ones.