onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Exalted monks!

Here is a script #1 that I do not understand 100% - I did not write it. But what it supposedly does is to look for DNA regions that match a certain query DNA sequence with and without a certain number of mismatches. Reminder: DNA sequences are composed of only A, T, G and C.

In my own script #2 below, I am trying to parse the output from a bioinformatics software to look for exact matches to a certain protein query sequence. the protein alphabet is composed of 20 letters - (ACDEFGHIKLMNPQRSTVWY). The script works OK to find exact matches. But I now want the modified script to also find inexact matches. So, I need to use parts of script #1 in my own script #2 so that I may be able to find partial matches to my protein query sequence defined as a regex

I've tried to modify script #2 to incorporate some elements of script #1, but this is a work in progress. One key few difference in how script #1 and script #2 will behave is that $sequence in my script #2 here is extracted from the parsing of my input whereas in script #1, it is obtained via the sub get_genome_sequence - so I think sub get_genome_sequence should be unnecessary for my script #2, right?

Importantly, I have to admit that I dont quite understand how sub make_approximate works, which is why I am having trouble modifying it for use in my script #2. Could one of you kindly help?

Many thanks

Script #1 #!/usr/bin/perl -w use strict; my $ntca_pattern = "GTA.{8}TAC.{20,24}TA.{3}T"; # Pattern to search + for my $sequence = get_genome_sequence("NostocChromosome.nt"); # Sequence to searc +h within # Find and print out all the exact matches # my $exact_pattern = fuzzy_pattern($ntca_pattern, 0); my @exact_matches = match_positions($exact_pattern, $sequence); if (@exact_matches) { print "Exact matches:\n"; print_matches(@exact_matches); } # Now do the same, but allow one mismatch # my $one_mismatch = fuzzy_pattern($ntca_pattern, 1); my @approximate_matches = match_positions($one_mismatch, $sequence); if (@approximate_matches) { print "Matches with one possible mismatch\n"; print_matches(@approximate_matches); } sub print_matches { my (@matches) = @_; foreach my $match (@matches) { print $match, "\n"; } } sub get_genome_sequence { my ($file_path) = @_; open GENOME, "<$file_path" or die "Can't open $file_path: $!\n"; $_ = <GENOME>; # discard first line my $sequence = ""; while (<GENOME>) { chomp($_); # Remove line break s/\r//; # And carriage return $_ = uc $_; # Make the line uppercase; $sequence = $sequence . $_; # Append the line to the sequence } return $sequence; } ##### Black magic past this point # use re qw(eval); use vars qw($matchStart); # match_positions($pattern, $sequence) returns the start and end point +s of all # places in $sequence matching $pattern. # sub match_positions { my $pattern; local $_; ($pattern, $_) = @_; my @results; local $matchStart; my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/; while (/$instrumentedPattern/g) { my $nextStart = pos(); push @results, "[$matchStart..$nextStart)"; pos() = $matchStart+1; } return @results; } # fuzzy_pattern($pattern, $mismatches) returns a pattern that matches +whatever # $pattern matches, except that bases may differ in at most $mismatche +s # positions. # sub fuzzy_pattern { my ($original_pattern, $mismatches_allowed) = @_; $mismatches_allowed >= 0 or die "Number of mismatches must be greater than or equal to ze +ro\n"; my $new_pattern = make_approximate($original_pattern, $mismatches_a +llowed); return qr/$new_pattern/; } # make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)" # make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))" # make_approximate("ACGT", 2) returns # "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))" # sub make_approximate { my ($pattern, $mismatches_allowed) = @_; if ($mismatches_allowed == 0) { return $pattern } elsif (length($pattern) <= $mismatches_allowed) { $pattern =~ tr/ACTG/./; return $pattern } else { my ($first, $rest) = $pattern =~ /^(.)(.*)/; my $after_match = make_approximate($rest, $mismatches_allowed); if ($first =~ /[ACGT]/) { my $after_miss = make_approximate($rest, $mismatches_allowed- +1); return "(?:$first$after_match|.$after_miss)"; } else { return "$first$after_match" } } }

Script #2 #! usr/bin/perl use strict; use warnings; my @filenames = @ARGV; my (@input, @IDs_and_seqs, @splits, @second, @second_splits, @chrs, $h +eader,@ref, @output, $exp_start_coord,$exp_end_coord,); my $motif = "[IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL][WY][DE +]?[IFVL][RK]?"; # Pattern to search for, specific to DWD motif fro +m journal publication open(IN1, '<', $filenames[0]) or die "Can't read your genomic DNA mult +ifasta as input file #1 $_: $! \n"; open(IN2, '<', $filenames[1]) or die "Can't read your fastBlockSearch +output text as input file #2 $_: $! \n"; ################------------------------------------------------------ +------################ while (<IN1>) { # input1 needs to be a genome multifasta file t +hat has been processed to remove \n from sequences using my custom Pe +rl script # fix_multifasta0.pl (written by AksR - with file +named appended with "cleaned-up" chomp; push @IDs_and_seqs, $_;} my %headers_IDs = @IDs_and_seqs; ################------------------------------------------------------ +------################ my $counter = 0; my @IDs_and_lengths; while ($counter <= $#IDs_and_seqs) { push @IDs_and_lengths, $IDs_and_s +eqs[$counter]; push @IDs_and_lengths, length($IDs_and_seqs[($counter+ +1)]);$counter=$counter+2;} my %seqs_lengths = @IDs_and_lengths; # note that the seqs IDs include +the '>' character preceeding the ID name ################------------------------------------------------------ +------################ while(<IN2>) { #chomp; push @input, $_;} #my @array = <IN>; alternative to iterating over whi +le loop my $splice = join ('',@input); my @first_splits = split('Hits found in ',$splice); shift @first_splits; my $z = 0; my @cleaned_up; while ($z<=$#first_splits) { if($first_splits[$z] =~ m/Score/) {push @ +cleaned_up,$first_splits[$z];} $z++;} ################------------------------------------------------------ +------################ foreach(@cleaned_up) { my @second_splits = split ('\n',$_,2); push @second, @second_splits; } ################------------------------------------------------------ +------################ my $i = 0; while ($i <= $#second) { my $chr_name = '>'.$second[$i]; my @hits = split ('--',$second[($i+1)],); pop @hits; foreach(@hits) { my $coord; my ($start_coord,$end_coord,$seq); my (@coords,@sorted_coords, $sequence); my @line_splits = split '\n+',$_; foreach(@line_splits) { if ($_ =~ m/unknown_/) { my @final_split = split '\s+',$_; $coord = shift@final_split; $coord = int($coord); $sequence = pop@final_split; ################------------------------------------------------------ +------################ my $exact_pattern = fuzzy_pattern($motif, 0); my @exact_matches = match_positions($exact_pattern, $sequence); if (@exact_matches) { print "Exact matches:\n"; print_matches(@exact_matches); } # Now do the same, but allow one mismatch # my $one_mismatch = fuzzy_pattern($motif, 1); my @approximate_matches = match_positions($one_mismatch, $sequence); # +$equence is defined by Anand's Perl script, not by this add-on if (@approximate_matches) { print "Matches with one possible mismatch\n"; print_matches(@approximate_matches); } sub print_matches { my (@matches) = @_; foreach my $match (@matches) { print $match, "\n"; } } use re qw(eval); use vars qw($matchStart); # match_positions($pattern, $sequence) returns the start and end point +s of all # places in $sequence matching $pattern. # sub match_positions { my $pattern; local $_; ($pattern, $_) = @_; my @results; local $matchStart; my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/; while (/$instrumentedPattern/g) { my $nextStart = pos(); push @results, "[$matchStart..$nextStart)"; pos() = $matchStart+1; } return @results; } # fuzzy_pattern($pattern, $mismatches) returns a pattern that matches +whatever # $pattern matches, except that bases may differ in at most $mismatche +s # positions. # sub fuzzy_pattern { my ($original_pattern, $mismatches_allowed) = @_; $mismatches_allowed >= 0 or die "Number of mismatches must be greater than or equal to ze +ro\n"; my $new_pattern = make_approximate($original_pattern, $mismatches_a +llowed); return qr/$new_pattern/; } # make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)" # make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))" # make_approximate("ACGT", 2) returns # "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))" # sub make_approximate { my ($pattern, $mismatches_allowed) = @_; if ($mismatches_allowed == 0) { return $pattern } elsif (length($pattern) <= $mismatches_allowed) { $pattern =~ tr/ARNDCEQGHILKMFPSTWYV/./; return $pattern } else { my ($first, $rest) = $pattern =~ /^(.)(.*)/; my $after_match = make_approximate($rest, $mismatches_allowed); if ($first =~ /[ARNDCEQGHILKMFPSTWYV]/) { my $after_miss = make_approximate($rest, $mismatches_allowed- +1); return "(?:$first$after_match|.$after_miss)"; } else { return "$first$after_match" } } } ################------------------------------------------------------ +------################ #if ($sequence =~ m/([IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL +][WY][DE]?[IFVL][RK]?)/g) # { # my $exp_start_coord = ($coord)-5000; # the start coord of the + fastBlastSearch hit has been extended by 5kbp upstream wrt the chrom +osomal coords # my $exp_end_coord = ($coord)+5000; # the end coord of the fas +tBlastSearch hit has been extended by 5kbp downstream wrt the chromos +omal coords # $exp_start_coord = int($exp_start_coord); # $exp_end_coord = int($exp_end_coord); # if ($exp_start_coord < 0) {$exp_start_coord = 1;} # if ($exp_end_coord > int($seqs_lengths{$chr_name})) {$exp_end +_coord = int($seqs_lengths{$chr_name});} # push @ref, $chr_name.":"; # push @ref, ($exp_start_coord)."-"; # push @ref, ($exp_end_coord)."_fastBlockSearch-hits"."\n"; # $seq = substr($headers_IDs{$chr_name}, $exp_start_coord, ($ex +p_end_coord-$exp_start_coord+1)); # push @ref, $seq."\n"; # } } } } #foreach(@hits) loop $i = $i + 2; } #for outer most +while loop my $destination = $filenames[1]."_DWD-coding_DNAseqs.fasta"; open (OUT, '>', $destination) or die "Can't write to file $destination + : $!\n"; print OUT @ref; close OUT;

Replies are listed 'Best First'.
Re: Understanding a script that looks for mismatches to a regex
by AnomalousMonk (Archbishop) on Jul 07, 2012 at 20:06 UTC

    I don't know if will help you, but Dominus's excellent Higher-Order Perl (freely downloadable here) discusses a solution to a problem similar to yours in section 4.3.2 "Genomic Sequence Generator".

Re: Understanding a script that looks for mismatches to a regex
by frozenwithjoy (Priest) on Jul 07, 2012 at 02:43 UTC

    First off, congratulations on getting the onlyIDleft.

    Secondly, I think the changes you made in sub make_approximate() are the appropriate changes to make to convert it from nucleotides to amino acids.

    Fourthly, I think the problem arises when you use square brackets in your original motif for matching. I ran it using the following:

    my $motif = [IFVL][RK][AGST]; my $answer = make_approximate($motif, 1); print $answer, \n;

    output: [(?:I(?:F(?:V(?:L][(?:R(?:K][(?:A(?:G(?:S(?:T]|.])|.T])|.ST])|.GST])|.][AGST])|.K][AGST])|.][RK][AGST])|.L][RK][AGST])|.VL][RK][AGST])|.FVL][RK][AGST])

    Since [IFVL] matches a single character of I or F or V or L, if you use square brackets[IFVL], the subroutine breaks this unit apart. Instead of the output we get above, we would have wanted something like this:

    (?:[IFVL](?:[RK].|.[AGST])|.[RK][AGST])

    So, if you want to use square brackets, you will need to incorporate a step that checks for opening and closing of square brackets and acts differently if inside brackets. Alternatively, you could substitute in placeholders for groupings like [IFVL]. This might be better since you could have a hash key for polar amino acids, small and large hydrophobic, etc. Anyway, I’ll leave that for you to think about.

      Unfortunately, I dont think I have the know how to to modify the regex in order use 'placeholders' as you'd suggested. I looked it up on the internet and tried a few things, none successful Do you think you could help with some explicit code or guide me to some relevent source of info? Thanks frozenwithjoy

Re: Understanding a script that looks for mismatches to a regex
by frozenwithjoy (Priest) on Jul 07, 2012 at 02:54 UTC

    You mentioned that you were having a hard time understanding the make_approximate() subroutine. I’ll try to walk you through it step-by-step. Hopefully, it will start to make sense:


    Just a quick demo how to interpret the output:

    (?:A(?:C.|.T)|.CT) <– Sample output from make_approximate(ACT, 1)
    (  A(  C.|.T)|.CT) <– Remove ?: for ease of reading. These just indicate that the parentheses are not backreferences.
    (  A(  C.   )    ) == ACN
    (  A(     .T)    ) == ANT
    (   (    |  )|.CT) == NCT