Matching Character Patterns With Backreferences

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching Character Patterns With Backreferences by Eimi Metamorphoumai (Deacon) on Dec 05, 2004 at 22:09 UTC
I believe the backreferences don't work inside character classes. But you can use them inside negative lookahead assertions, thus: `/^(\w)(?!.{0,4}\1)(\w)\2(?!.{0,2}\2)(\w)(?!.{0,1}\3)(\w)(?!\4)(\w)\s/;` [download] That is, find one letter and make sure it doesn't recur in the rest of the pattern. Then find the next, and make sure it's repeated immediately after, but not later on. And so on.	[reply] [d/l]
Re: Matching Character Patterns With Backreferences by Anonymous Monk on Dec 05, 2004 at 22:12 UTC
The problem is that `[^\1]` is interpreted as `[^\001]` and `\001` means ascii charachter 1 in octal format. A correct regex would be: (using look-aheads): m/^ # Match the start of the line (\w) # Match the first letter, capture it in $1 (or \1 if you +re-use it in the regex) ( # Open capturing group for $2 (?!\1) # Look ahead and verify that the next charachter isn't +the same as the first one \w # Capture a word-charachter (because of the look ahead +this can't be the same one as the first) ) # Close capture $2 \2 # Match the second charachter another time (no need to ca +pture this) ( # Open capture $3 (?! # Open look-ahead \1 # Charachter 1 \| # Or (don't use a charachter class, else you run in +the same trouble as before) \2 # Charachter 2 ) # Close look-ahed \w # Match a word-charachter ) # Close capture $3 ... /x; [download] I'm sure you see the pattern in this so I'm not going to go on explenaning. The complete regex would be (tested against LOOKED and SEEMED): `m/^ (\w) ((?!\1)\w)\2 ((?!\1\|\2)\w) ((?!\1\|\2\|\3)\w) ((?!\1\|\2\|\3\|\4)\w) \s$/x` x-modifier is used to make the groups a little bit more visible	[reply] [d/l] [select]
Re^2: Matching Character Patterns With Backreferences by Anonymous Monk on Dec 05, 2004 at 23:49 UTC
Beautiful, thank you fellow Anonymous Monk. I wish the behaviour of \1 inside a character class was better documented. Does anyone know of a good listing of all these kinds of gotchas?	[reply]
Re^3: Matching Character Patterns With Backreferences by dimar (Curate) on Dec 06, 2004 at 05:55 UTC
You might try `perldoc perltrap` which is a pretty good centralized source of 'gotchas' to avoid.	[reply] [d/l]
Re: Matching Character Patterns With Backreferences by davido (Cardinal) on Dec 06, 2004 at 03:41 UTC
You can construct character classes (and negated character classes) based on backreferences on the fly with the `(??{ .... })` regexp subexpression. Here's an example: `use strict; use warnings; my $string = "that bab"; print "Match: $1 => $2" if $string =~ m/(that).+((??{ '[' . $1 . ']' }))/; __OUTPUT__ Match: that => a` [download] The above example matches 'that', and then uses the characters in 'that' as characters in a character class constructed within the (??{...}) subexpression. Since 'a' was one of the characters within 'that', it is matched, and captured into $2. This gets a little tripped up if the first character held in the backreference is also a character class metacharacter, such as ^ (which would construct a negated character class), or if there is an embedded hyphen, which would attempt to construct a range, as in 'a-z'. And be careful with embedded backslashes because within a character class they might look like '\d' (the combination used for matching any digit character. So this technique is somewhat sensitive to the type of data you're looking at, but with reasonably sanitary strings it can be effective. Dave	[reply] [d/l] [select]
Re: Matching Character Patterns With Backreferences by gaal (Parson) on Dec 05, 2004 at 21:02 UTC
I'm not a regexp wizard, but playing around with this I've found that if you use `$1` and friends, they are populated until a submatch fails (this does not happen with `\1` form). No idea why this doesn't fail the whole expression, but at least it suggests a workaround until someone explains what's really going on: `my @matches = $word =~ /^(\w)([^$1])($2)([^$1$2$3])([^$1$2$3$4])([^$1$ +2$3$4$5])/; print $word if @matches == 6; # or length $word, whatever.` [download] BTW, what's that `\s` doing in your regexp?	[reply] [d/l] [select]
Re^2: Matching Character Patterns With Backreferences by Anonymous Monk on Dec 05, 2004 at 21:12 UTC
I'm having trouble getting this to work with the $1 form - it's simply not matching anything the moment I make that change. The \s is simply because each line of the dictionary file contains a word followed by whitespace. Checking for the space stops me from printing out "TESTING" when I only want "TEST" :-)	[reply]
Re^3: Matching Character Patterns With Backreferences by BUU (Prior) on Dec 06, 2004 at 00:42 UTC
You should be using `\1` and friends in the left hand side of the `s///` operator.	[reply] [d/l] [select]
Re^4: Matching Character Patterns With Backreferences by gaal (Parson) on Dec 06, 2004 at 17:37 UTC
Re^5: Matching Character Patterns With Backreferences by BUU (Prior) on Dec 07, 2004 at 04:13 UTC
Some notes below your chosen depth have not been shown here
Re: Matching Character Patterns With Backreferences by dimar (Curate) on Dec 06, 2004 at 07:04 UTC
Sometimes the most "fun" code to ponder is actually the worst code to maintain, or try to figure out after you haven't seen it for a long time. I think this thread is a case in point. It is definitely informative to learn how to solve this problem using a RegEx, but that begs the question of whether you really want to use a RegEx. I wouldn't, because, as you said yourself, it looks a bit ugly. It seems easier to just recognize that the word "ABBCBD" is a simple substitution cypher for "SEEMED" ... and then just try to write a very simple subroutine that helps us find different ways to "decrypt" our pattern "ABBCBD" (or whatever your pattern is) using an English dictionary. It is less fun to do it the non RegEx way, but (arguably) easier to understand and debug (YMMV). Here's one way to do it ... TMTOWTDI. use strict; use warnings; my $sPattern; my @aPlainText = qw(looked seemed liked cooked booked baked balked); ### $sPattern = "ABBCBD"; ### try this one later $sPattern = "ABBCED"; while(<@aPlainText>){ print "$_\n" if IsEqual($sPattern,$_); } sub IsEqual { my $sPattern = shift \|\| die "missing argument"; my $sPlain = shift \|\| die "missing argument"; my $iLen = 0; my %hSubstit1 = (); my %hSubstit2 = (); if(length($sPattern) != length($sPlain)){return 0;} else{$iLen = length($sPattern)-1}; for (0 .. $iLen){ my $chrCiphr = lc(substr($sPattern,$_,1)); my $chrPlain = lc(substr($sPlain,$_,1)); ### require a 1 to 1 mapping between plain chars and pattern c +hars $hSubstit1{$chrPlain} = $chrCiphr unless exists $hSubstit1 +{$chrPlain}; $hSubstit2{$chrCiphr} = $chrPlain unless exists $hSubstit2 +{$chrCiphr}; if($chrCiphr ne $hSubstit1{$chrPlain }){return 0;} if($chrPlain ne $hSubstit2{$chrCiphr }){return 0;} } return(1); }###end_sub [download]	[reply] [d/l]
Re: Matching Character Patterns With Backreferences by Anonymous Monk on Dec 05, 2004 at 20:53 UTC
I just realized I should probably paste the code I had that generated that.I should mention that $d is the path to the dictionary file (with the words to match) and $p is the input pattern (ABBCDE above). This is pretty ugly `my %charhash; @letters = split(//,$p); $string = ""; $invert = ""; $count = 1; foreach $letter (@letters) { if(defined $charhash{$letter}) { $string.="(\\$charhash{$letter})"; } else { $charhash{$letter} = $count; if($invert) { $string .= "([^$invert])"; } else { $string .= "(\\w)"; } } $invert .= "\\$count"; $count++; } print "Searching for pattern /^$string\\s/\n"; open DICT, $d; while(<DICT>) { print if /^$string\s/i; }` [download]	[reply] [d/l]
�Re: Matching Character Patterns With Backreferences by merlyn (Sage) on Dec 06, 2004 at 12:05 UTC
Here's my implementation of that from 1999. Lowercase letters stood for themselves. Uppercase letters were the wild ones. Read more... (1428 Bytes)	[reply] [d/l]