onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:
Here is a script #1 that I do not understand 100% - I did not write it. But what it supposedly does is to look for DNA regions that match a certain query DNA sequence with and without a certain number of mismatches. Reminder: DNA sequences are composed of only A, T, G and C.
In my own script #2 below, I am trying to parse the output from a bioinformatics software to look for exact matches to a certain protein query sequence. the protein alphabet is composed of 20 letters - (ACDEFGHIKLMNPQRSTVWY). The script works OK to find exact matches. But I now want the modified script to also find inexact matches. So, I need to use parts of script #1 in my own script #2 so that I may be able to find partial matches to my protein query sequence defined as a regex
I've tried to modify script #2 to incorporate some elements of script #1, but this is a work in progress. One key few difference in how script #1 and script #2 will behave is that $sequence in my script #2 here is extracted from the parsing of my input whereas in script #1, it is obtained via the sub get_genome_sequence - so I think sub get_genome_sequence should be unnecessary for my script #2, right?
Importantly, I have to admit that I dont quite understand how sub make_approximate works, which is why I am having trouble modifying it for use in my script #2. Could one of you kindly help?
Many thanks
Script #1 #!/usr/bin/perl -w use strict; my $ntca_pattern = "GTA.{8}TAC.{20,24}TA.{3}T"; # Pattern to search + for my $sequence = get_genome_sequence("NostocChromosome.nt"); # Sequence to searc +h within # Find and print out all the exact matches # my $exact_pattern = fuzzy_pattern($ntca_pattern, 0); my @exact_matches = match_positions($exact_pattern, $sequence); if (@exact_matches) { print "Exact matches:\n"; print_matches(@exact_matches); } # Now do the same, but allow one mismatch # my $one_mismatch = fuzzy_pattern($ntca_pattern, 1); my @approximate_matches = match_positions($one_mismatch, $sequence); if (@approximate_matches) { print "Matches with one possible mismatch\n"; print_matches(@approximate_matches); } sub print_matches { my (@matches) = @_; foreach my $match (@matches) { print $match, "\n"; } } sub get_genome_sequence { my ($file_path) = @_; open GENOME, "<$file_path" or die "Can't open $file_path: $!\n"; $_ = <GENOME>; # discard first line my $sequence = ""; while (<GENOME>) { chomp($_); # Remove line break s/\r//; # And carriage return $_ = uc $_; # Make the line uppercase; $sequence = $sequence . $_; # Append the line to the sequence } return $sequence; } ##### Black magic past this point # use re qw(eval); use vars qw($matchStart); # match_positions($pattern, $sequence) returns the start and end point +s of all # places in $sequence matching $pattern. # sub match_positions { my $pattern; local $_; ($pattern, $_) = @_; my @results; local $matchStart; my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/; while (/$instrumentedPattern/g) { my $nextStart = pos(); push @results, "[$matchStart..$nextStart)"; pos() = $matchStart+1; } return @results; } # fuzzy_pattern($pattern, $mismatches) returns a pattern that matches +whatever # $pattern matches, except that bases may differ in at most $mismatche +s # positions. # sub fuzzy_pattern { my ($original_pattern, $mismatches_allowed) = @_; $mismatches_allowed >= 0 or die "Number of mismatches must be greater than or equal to ze +ro\n"; my $new_pattern = make_approximate($original_pattern, $mismatches_a +llowed); return qr/$new_pattern/; } # make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)" # make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))" # make_approximate("ACGT", 2) returns # "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))" # sub make_approximate { my ($pattern, $mismatches_allowed) = @_; if ($mismatches_allowed == 0) { return $pattern } elsif (length($pattern) <= $mismatches_allowed) { $pattern =~ tr/ACTG/./; return $pattern } else { my ($first, $rest) = $pattern =~ /^(.)(.*)/; my $after_match = make_approximate($rest, $mismatches_allowed); if ($first =~ /[ACGT]/) { my $after_miss = make_approximate($rest, $mismatches_allowed- +1); return "(?:$first$after_match|.$after_miss)"; } else { return "$first$after_match" } } }
Script #2 #! usr/bin/perl use strict; use warnings; my @filenames = @ARGV; my (@input, @IDs_and_seqs, @splits, @second, @second_splits, @chrs, $h +eader,@ref, @output, $exp_start_coord,$exp_end_coord,); my $motif = "[IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL][WY][DE +]?[IFVL][RK]?"; # Pattern to search for, specific to DWD motif fro +m journal publication open(IN1, '<', $filenames[0]) or die "Can't read your genomic DNA mult +ifasta as input file #1 $_: $! \n"; open(IN2, '<', $filenames[1]) or die "Can't read your fastBlockSearch +output text as input file #2 $_: $! \n"; ################------------------------------------------------------ +------################ while (<IN1>) { # input1 needs to be a genome multifasta file t +hat has been processed to remove \n from sequences using my custom Pe +rl script # fix_multifasta0.pl (written by AksR - with file +named appended with "cleaned-up" chomp; push @IDs_and_seqs, $_;} my %headers_IDs = @IDs_and_seqs; ################------------------------------------------------------ +------################ my $counter = 0; my @IDs_and_lengths; while ($counter <= $#IDs_and_seqs) { push @IDs_and_lengths, $IDs_and_s +eqs[$counter]; push @IDs_and_lengths, length($IDs_and_seqs[($counter+ +1)]);$counter=$counter+2;} my %seqs_lengths = @IDs_and_lengths; # note that the seqs IDs include +the '>' character preceeding the ID name ################------------------------------------------------------ +------################ while(<IN2>) { #chomp; push @input, $_;} #my @array = <IN>; alternative to iterating over whi +le loop my $splice = join ('',@input); my @first_splits = split('Hits found in ',$splice); shift @first_splits; my $z = 0; my @cleaned_up; while ($z<=$#first_splits) { if($first_splits[$z] =~ m/Score/) {push @ +cleaned_up,$first_splits[$z];} $z++;} ################------------------------------------------------------ +------################ foreach(@cleaned_up) { my @second_splits = split ('\n',$_,2); push @second, @second_splits; } ################------------------------------------------------------ +------################ my $i = 0; while ($i <= $#second) { my $chr_name = '>'.$second[$i]; my @hits = split ('--',$second[($i+1)],); pop @hits; foreach(@hits) { my $coord; my ($start_coord,$end_coord,$seq); my (@coords,@sorted_coords, $sequence); my @line_splits = split '\n+',$_; foreach(@line_splits) { if ($_ =~ m/unknown_/) { my @final_split = split '\s+',$_; $coord = shift@final_split; $coord = int($coord); $sequence = pop@final_split; ################------------------------------------------------------ +------################ my $exact_pattern = fuzzy_pattern($motif, 0); my @exact_matches = match_positions($exact_pattern, $sequence); if (@exact_matches) { print "Exact matches:\n"; print_matches(@exact_matches); } # Now do the same, but allow one mismatch # my $one_mismatch = fuzzy_pattern($motif, 1); my @approximate_matches = match_positions($one_mismatch, $sequence); # +$equence is defined by Anand's Perl script, not by this add-on if (@approximate_matches) { print "Matches with one possible mismatch\n"; print_matches(@approximate_matches); } sub print_matches { my (@matches) = @_; foreach my $match (@matches) { print $match, "\n"; } } use re qw(eval); use vars qw($matchStart); # match_positions($pattern, $sequence) returns the start and end point +s of all # places in $sequence matching $pattern. # sub match_positions { my $pattern; local $_; ($pattern, $_) = @_; my @results; local $matchStart; my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/; while (/$instrumentedPattern/g) { my $nextStart = pos(); push @results, "[$matchStart..$nextStart)"; pos() = $matchStart+1; } return @results; } # fuzzy_pattern($pattern, $mismatches) returns a pattern that matches +whatever # $pattern matches, except that bases may differ in at most $mismatche +s # positions. # sub fuzzy_pattern { my ($original_pattern, $mismatches_allowed) = @_; $mismatches_allowed >= 0 or die "Number of mismatches must be greater than or equal to ze +ro\n"; my $new_pattern = make_approximate($original_pattern, $mismatches_a +llowed); return qr/$new_pattern/; } # make_approximate("ACT", 1) returns "(?:A(?:C.|.T)|.CT)" # make_approximate("ACT", 2) returns "(?:A..|.(?:C.|.T))" # make_approximate("ACGT", 2) returns # "(?:A(?:C..|.(?:G.|.T))|.(?:C(?:G.|.T)|.GT))" # sub make_approximate { my ($pattern, $mismatches_allowed) = @_; if ($mismatches_allowed == 0) { return $pattern } elsif (length($pattern) <= $mismatches_allowed) { $pattern =~ tr/ARNDCEQGHILKMFPSTWYV/./; return $pattern } else { my ($first, $rest) = $pattern =~ /^(.)(.*)/; my $after_match = make_approximate($rest, $mismatches_allowed); if ($first =~ /[ARNDCEQGHILKMFPSTWYV]/) { my $after_miss = make_approximate($rest, $mismatches_allowed- +1); return "(?:$first$after_match|.$after_miss)"; } else { return "$first$after_match" } } } ################------------------------------------------------------ +------################ #if ($sequence =~ m/([IFVL][IFVL][AGST][AGST][AGST].[DE]..[IFVL].[IFVL +][WY][DE]?[IFVL][RK]?)/g) # { # my $exp_start_coord = ($coord)-5000; # the start coord of the + fastBlastSearch hit has been extended by 5kbp upstream wrt the chrom +osomal coords # my $exp_end_coord = ($coord)+5000; # the end coord of the fas +tBlastSearch hit has been extended by 5kbp downstream wrt the chromos +omal coords # $exp_start_coord = int($exp_start_coord); # $exp_end_coord = int($exp_end_coord); # if ($exp_start_coord < 0) {$exp_start_coord = 1;} # if ($exp_end_coord > int($seqs_lengths{$chr_name})) {$exp_end +_coord = int($seqs_lengths{$chr_name});} # push @ref, $chr_name.":"; # push @ref, ($exp_start_coord)."-"; # push @ref, ($exp_end_coord)."_fastBlockSearch-hits"."\n"; # $seq = substr($headers_IDs{$chr_name}, $exp_start_coord, ($ex +p_end_coord-$exp_start_coord+1)); # push @ref, $seq."\n"; # } } } } #foreach(@hits) loop $i = $i + 2; } #for outer most +while loop my $destination = $filenames[1]."_DWD-coding_DNAseqs.fasta"; open (OUT, '>', $destination) or die "Can't write to file $destination + : $!\n"; print OUT @ref; close OUT;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Understanding a script that looks for mismatches to a regex
by AnomalousMonk (Archbishop) on Jul 07, 2012 at 20:06 UTC | |
|
Re: Understanding a script that looks for mismatches to a regex
by frozenwithjoy (Priest) on Jul 07, 2012 at 02:43 UTC | |
by onlyIDleft (Scribe) on Jul 08, 2012 at 05:11 UTC | |
|
Re: Understanding a script that looks for mismatches to a regex
by frozenwithjoy (Priest) on Jul 07, 2012 at 02:54 UTC |