in reply to Understanding a script that looks for mismatches to a regex
You mentioned that you were having a hard time understanding the make_approximate() subroutine. I’ll try to walk you through it step-by-step. Hopefully, it will start to make sense:
We start by getting parameters passed from sub fuzzy_pattern():
01 my ( $pattern, $mismatches_allowed ) = @_;
If no mismatches are allowed, use the pattern passed to the subroutine and exit sub make_approximate():
02 if ( $mismatches_allowed == 0 ) { return $pattern }
If the original pattern is too short (i.e., as many or more mismatches allowed than the length), then change each character into a wildcard using the transliteration operator (tr///) on line 04. Use an all wildcard pattern and exit sub make_approximate():
03 elsif ( length($pattern) <= $mismatches_allowed ) { 04 $pattern =~ tr/ACTG/./; 05 return $pattern; 06 }
If we’ve reached line 07, then at least one mismatch is allowed and the pattern is longer than the # of mismatches.
07 else {
Line 08 matches a regex to the pattern and captures the pattern in two chunks: the first character (.) and all the other characters (.*). Surrounding a regex with parentheses creates what is called a backreference and allows you to capture the match. Here is is captured in a list containing two scalars. It really isn’t required in this situation, but the ^ symbol specifies a match to the beginning of a line (likewise, $ matches the end of a line).
08 my ( $first, $rest ) = $pattern =~ /^(.)(.*)/;
Line 09 might seem confusing at first; however, it is just that this subroutine is recursive (and gets more confusing as we go on). It calls itself inside itself. This is essentially the same as when the subroutine was called by sub fuzzy_pattern() except that the new pattern ($rest) is one character shorter than the original pattern. This subroutine will keep calling itself within itself until the deepest iteration hits one of the two return statements above on lines 02 or 05. At that point the second deepest iteration to the line after the one we are on now.
09 my $after_match = make_approximate( $rest, $mismatches_allowed );
Up above on ine 08, we pulled one character ($first) off the front of the the pattern. If this character was originally specified as a nucleotide [ACT(or)G], then pretend like you are changing it to a wildcard and do another call of this subroutine on line 11, but with $mismatches_allowed decreased by one, since we just coerced a nucleotide into a mismatch. At this point we return a pattern that essentially says match either A) the first character + the result of the recursively called subroutine on line 09 or B) a wildcard + the result from line 09 (my brain is starting to unravel…).
10 if ( $first =~ /[ACGT]/ ) { 11 my $after_miss = make_approximate( $rest, $mismatches_al +lowed - 1 ); 12 return (?:$first$after_match|.$after_miss); 13 }
If we didn’t enter the block after the if on line 10 (likely because some wildcards in our original pattern specified in the main script such as .{1,4}), then we return A) the first character + the result of subroutine on line 09 and we are done… Well, sort of. Really, we just pull ourselves out of the recursive hole we’ve dug and we eventually get out.
14 else { return $first$after_match } 15 }
Just a quick demo how to interpret the output:
(?:A(?:C.|.T)|.CT) <– Sample output from make_approximate(ACT, 1)
( A( C.|.T)|.CT) <– Remove ?: for ease of reading. These just indicate that the parentheses are not backreferences.
( A( C. ) ) == ACN
( A( .T) ) == ANT
( ( | )|.CT) == NCT
|
|---|