Understanding a script that looks for mismatches to a regex

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Exalted monks!

Here is a script #1 that I do not understand 100% - I did not write it. But what it supposedly does is to look for DNA regions that match a certain query DNA sequence with and without a certain number of mismatches. Reminder: DNA sequences are composed of only A, T, G and C.

In my own script #2 below, I am trying to parse the output from a bioinformatics software to look for exact matches to a certain protein query sequence. the protein alphabet is composed of 20 letters - (ACDEFGHIKLMNPQRSTVWY). The script works OK to find exact matches. But I now want the modified script to also find inexact matches. So, I need to use parts of script #1 in my own script #2 so that I may be able to find partial matches to my protein query sequence defined as a regex

I've tried to modify script #2 to incorporate some elements of script #1, but this is a work in progress. One key few difference in how script #1 and script #2 will behave is that $sequence in my script #2 here is extracted from the parsing of my input whereas in script #1, it is obtained via the sub get_genome_sequence - so I think sub get_genome_sequence should be unnecessary for my script #2, right?

Importantly, I have to admit that I dont quite understand how sub make_approximate works, which is why I am having trouble modifying it for use in my script #2. Could one of you kindly help?

Many thanks

Script #1
#!/usr/bin/perl -w
use strict;

my $ntca_pattern = "GTA.{8}TAC.{20,24}TA.{3}T";    # Pattern to search
+ for
my $sequence =
    get_genome_sequence("NostocChromosome.nt");    # Sequence to searc
+h within

# Find and print out all the exact matches
#
my $exact_pattern = fuzzy_pattern($ntca_pattern, 0);
my @exact_matches = match_positions($exact_pattern, $sequence);
if (@exact_matches) {
   print "Exact matches:\n";
   print_matches(@exact_matches);
}

# Now do the same, but allow one mismatch
#
my $one_mismatch = fuzzy_pattern($ntca_pattern, 1);
my @approximate_matches = match_positions($one_mismatch, $sequence);
if (@approximate_matches) {
   print "Matches with one possible mismatch\n";
   print_matches(@approximate_matches);
}

sub print_matches {
   my (@matches) = @_;
   foreach my $match (@matches) {
      print $match, "\n";
   }
}

sub get_genome_sequence {
   my ($file_path) = @_;
   open GENOME, "<$file_path" or die "Can't open $file_path: $!\n";
   $_ = <GENOME>; # discard first line
   my $sequence = ""; 
   while (<GENOME>) {
      chomp($_);                  # Remove line break
      s/\r//;                     # And carriage return
      $_ = uc $_;                 # Make the line uppercase;
      $sequence = $sequence . $_; # Append the line to the sequence
   }
   return $sequence;
}

##### Black magic past this point
#
use re qw(eval);
use vars qw($matchStart);

# match_positions($pattern, $sequence) returns the start and end point
+s of all
# places in $sequence matching $pattern.
#
sub match_positions {
   my $pattern;
   local $_;
   ($pattern, $_) = @_;
   my @results;
   local $matchStart;
   my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/;
   while (/$instrumentedPattern/g) {
      my $nextStart = pos();
      push @results, "[$matchStart..$nextStart)";
      pos() = $matchStart+1;
   }
   return @results;
}

# fuzzy_pattern($pattern, $mismatches) returns a pattern that matches 
+whatever
# $pattern matches, except that bases may differ in at most $mismatche
+s
# positions.
#
sub fuzzy_pattern {
   my ($original_pattern, $mismatches_allowed) = @_;
   $mismatches_allowed >= 0
      or die "Number of mismatches must be greater than or equal to ze
+ro\n";
   my $new_pattern = make_approximate($original_pattern, $mismatches_a
+llowed);
   return qr/$new_pattern/;
}

# make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)"
# make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))"
# make_approximate("ACGT", 2) returns
#        "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))"
#
sub make_approximate {
   my ($pattern, $mismatches_allowed) = @_;
   if ($mismatches_allowed == 0) { return $pattern }
   elsif (length($pattern) <= $mismatches_allowed)
      { $pattern =~ tr/ACTG/./; return $pattern }
   else {
      my ($first, $rest) = $pattern =~ /^(.)(.*)/;
      my $after_match = make_approximate($rest, $mismatches_allowed);
      if ($first =~ /[ACGT]/) {
         my $after_miss = make_approximate($rest, $mismatches_allowed-
+1);
         return "(?:$first$after_match|.$after_miss)";
      }
      else { return "$first$after_match" }
   }
}
[download]

Script #2
#! usr/bin/perl

use strict;
use warnings;

my @filenames = @ARGV;
my (@input, @IDs_and_seqs, @splits, @second, @second_splits, @chrs, $h
+eader,@ref, @output, $exp_start_coord,$exp_end_coord,);
my $motif = "[IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL][WY][DE
+]?[IFVL][RK]?";    # Pattern to search for, specific to DWD motif fro
+m journal publication


open(IN1, '<', $filenames[0]) or die "Can't read your genomic DNA mult
+ifasta as input file #1 $_: $! \n";
open(IN2, '<', $filenames[1]) or die "Can't read your fastBlockSearch 
+output text as input file #2 $_: $! \n";

################------------------------------------------------------
+------################
while (<IN1>)    {     # input1 needs to be a genome multifasta file t
+hat has been processed to remove \n from sequences using my custom Pe
+rl script
                    # fix_multifasta0.pl (written by AksR - with file 
+named appended with "cleaned-up"
chomp; push @IDs_and_seqs, $_;}
my %headers_IDs = @IDs_and_seqs;
################------------------------------------------------------
+------################
my $counter = 0;
my @IDs_and_lengths;
while ($counter <= $#IDs_and_seqs) { push @IDs_and_lengths, $IDs_and_s
+eqs[$counter]; push @IDs_and_lengths, length($IDs_and_seqs[($counter+
+1)]);$counter=$counter+2;}
my %seqs_lengths = @IDs_and_lengths; # note that the seqs IDs include 
+the '>' character preceeding the ID name
################------------------------------------------------------
+------################
while(<IN2>)    { 
    #chomp; 
push @input, $_;} #my @array = <IN>; alternative to iterating over whi
+le loop
my $splice = join ('',@input);
my @first_splits = split('Hits found in ',$splice);
shift @first_splits;
my $z = 0; my @cleaned_up;
while ($z<=$#first_splits) { if($first_splits[$z] =~ m/Score/) {push @
+cleaned_up,$first_splits[$z];} $z++;}
################------------------------------------------------------
+------################
foreach(@cleaned_up)    {
my @second_splits = split ('\n',$_,2);
push @second, @second_splits; 
}
################------------------------------------------------------
+------################
my $i = 0;
while ($i <= $#second)                         {
my $chr_name = '>'.$second[$i];
my @hits = split ('--',$second[($i+1)],);
pop @hits;

foreach(@hits)                            {
my $coord;

my ($start_coord,$end_coord,$seq);
my (@coords,@sorted_coords, $sequence);

my @line_splits = split '\n+',$_;

foreach(@line_splits)                {
if ($_ =~ m/unknown_/)        { 
    my @final_split = split '\s+',$_; 
    $coord = shift@final_split;
    $coord = int($coord);

    $sequence = pop@final_split;

################------------------------------------------------------
+------################

my $exact_pattern = fuzzy_pattern($motif, 0);
my @exact_matches = match_positions($exact_pattern, $sequence);
if (@exact_matches) {
   print "Exact matches:\n";
   print_matches(@exact_matches);
}

# Now do the same, but allow one mismatch
#
my $one_mismatch = fuzzy_pattern($motif, 1);
my @approximate_matches = match_positions($one_mismatch, $sequence); #
+$equence  is defined by Anand's Perl script, not by this add-on
if (@approximate_matches) {
   print "Matches with one possible mismatch\n";
   print_matches(@approximate_matches);
}

sub print_matches {
   my (@matches) = @_;
   foreach my $match (@matches) {
      print $match, "\n";
   }
}


use re qw(eval);
use vars qw($matchStart);

# match_positions($pattern, $sequence) returns the start and end point
+s of all
# places in $sequence matching $pattern.
#
sub match_positions {
   my $pattern;
   local $_;
   ($pattern, $_) = @_;
   my @results;
   local $matchStart;
   my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/;
   while (/$instrumentedPattern/g) {
      my $nextStart = pos();
      push @results, "[$matchStart..$nextStart)";
      pos() = $matchStart+1;
   }
   return @results;
}

# fuzzy_pattern($pattern, $mismatches) returns a pattern that matches 
+whatever
# $pattern matches, except that bases may differ in at most $mismatche
+s
# positions.
#
sub fuzzy_pattern {
   my ($original_pattern, $mismatches_allowed) = @_;
   $mismatches_allowed >= 0
      or die "Number of mismatches must be greater than or equal to ze
+ro\n";
   my $new_pattern = make_approximate($original_pattern, $mismatches_a
+llowed);
   return qr/$new_pattern/;
}

# make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)"
# make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))"
# make_approximate("ACGT", 2) returns
#        "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))"
#
sub make_approximate {
   my ($pattern, $mismatches_allowed) = @_;
   if ($mismatches_allowed == 0) { return $pattern }
   elsif (length($pattern) <= $mismatches_allowed)
      { $pattern =~ tr/ARNDCEQGHILKMFPSTWYV/./; return $pattern }
   else {
      my ($first, $rest) = $pattern =~ /^(.)(.*)/;
      my $after_match = make_approximate($rest, $mismatches_allowed);
      if ($first =~ /[ARNDCEQGHILKMFPSTWYV]/) {
         my $after_miss = make_approximate($rest, $mismatches_allowed-
+1);
         return "(?:$first$after_match|.$after_miss)";
      }
      else { return "$first$after_match" }
   }
}

################------------------------------------------------------
+------################

#if ($sequence =~ m/([IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL
+][WY][DE]?[IFVL][RK]?)/g)
#                    {    
#        my $exp_start_coord = ($coord)-5000; # the start coord of the
+ fastBlastSearch hit has been extended by 5kbp upstream wrt the chrom
+osomal coords
#        my $exp_end_coord = ($coord)+5000; # the end coord of the fas
+tBlastSearch hit has been extended by 5kbp downstream wrt the chromos
+omal coords

#        $exp_start_coord = int($exp_start_coord);
#        $exp_end_coord = int($exp_end_coord);

#        if ($exp_start_coord < 0) {$exp_start_coord = 1;}
#        if ($exp_end_coord > int($seqs_lengths{$chr_name})) {$exp_end
+_coord = int($seqs_lengths{$chr_name});}

#        push @ref, $chr_name.":";
#        push @ref, ($exp_start_coord)."-"; 
#        push @ref, ($exp_end_coord)."_fastBlockSearch-hits"."\n"; 

#        $seq = substr($headers_IDs{$chr_name}, $exp_start_coord, ($ex
+p_end_coord-$exp_start_coord+1));
#        push @ref, $seq."\n";
#                    }
                            }
                                    }
                                            } #foreach(@hits) loop

$i = $i + 2;
                                                    } #for outer most 
+while loop
my $destination = $filenames[1]."_DWD-coding_DNAseqs.fasta";
open (OUT, '>', $destination) or die "Can't write to file $destination
+ : $!\n";
print OUT @ref;
close OUT;
[download]

Comment on Understanding a script that looks for mismatches to a regex Select or Download Code

Replies are listed 'Best First'.
Re: Understanding a script that looks for mismatches to a regex by AnomalousMonk (Archbishop) on Jul 07, 2012 at 20:06 UTC
I don't know if will help you, but Dominus's excellent Higher-Order Perl (freely downloadable here) discusses a solution to a problem similar to yours in section 4.3.2 "Genomic Sequence Generator".	[reply]
Re: Understanding a script that looks for mismatches to a regex by frozenwithjoy (Priest) on Jul 07, 2012 at 02:43 UTC
First off, congratulations on getting the onlyIDleft. Secondly, I think the changes you made in `sub make_approximate()` are the appropriate changes to make to convert it from nucleotides to amino acids. Fourthly, I think the problem arises when you use square brackets in your original motif for matching. I ran it using the following: `my $motif = [IFVL][RK][AGST]; my $answer = make_approximate($motif, 1); print $answer, \n;` [download] output: `[(?:I(?:F(?:V(?:L][(?:R(?:K][(?:A(?:G(?:S(?:T]\|.])\|.T])\|.ST])\|.GST])\|.][AGST])\|.K][AGST])\|.][RK][AGST])\|.L][RK][AGST])\|.VL][RK][AGST])\|.FVL][RK][AGST])` Since `[IFVL]` matches a single character of I or F or V or L, if you use square brackets`[IFVL]`, the subroutine breaks this unit apart. Instead of the output we get above, we would have wanted something like this: `(?:[IFVL](?:[RK].\|.[AGST])\|.[RK][AGST])` So, if you want to use square brackets, you will need to incorporate a step that checks for opening and closing of square brackets and acts differently if inside brackets. Alternatively, you could substitute in placeholders for groupings like `[IFVL]`. This might be better since you could have a hash key for polar amino acids, small and large hydrophobic, etc. Anyway, I’ll leave that for you to think about.	[reply] [d/l] [select]
Re^2: Understanding a script that looks for mismatches to a regex by onlyIDleft (Scribe) on Jul 08, 2012 at 05:11 UTC
Unfortunately, I dont think I have the know how to to modify the regex in order use 'placeholders' as you'd suggested. I looked it up on the internet and tried a few things, none successful Do you think you could help with some explicit code or guide me to some relevent source of info? Thanks frozenwithjoy	[reply]
Re: Understanding a script that looks for mismatches to a regex by frozenwithjoy (Priest) on Jul 07, 2012 at 02:54 UTC
You mentioned that you were having a hard time understanding the `make_approximate()` subroutine. I’ll try to walk you through it step-by-step. Hopefully, it will start to make sense: Read more... (5 kB) Just a quick demo how to interpret the output: `(?:A(?:C.\|.T)\|.CT)` <– Sample output from `make_approximate(ACT, 1)` `( A( C.\|.T)\|.CT)` <– Remove `?:` for ease of reading. These just indicate that the parentheses are not backreferences. `( A( C. ) )` == ACN `( A( .T) )` == ANT `( ( \| )\|.CT)` == NCT	[reply] [d/l] [select]