Find a sequence in a multifasta files and motifs

rebkirl has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I am quite new to Perl scripting and I am trying to solve the following:

Given a list of mutlifasta file (file.txt):

>JJ57
MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI
NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL
>JJ22
MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ
FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS
KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL
>JJ41
MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF
QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML
[download]

Given a user input (i.e. JJ22) as argument, I want to check if the input is present in the file.txt and count all the motifs MT.KA (M or T, any letter (.) followed by K or A) in the JJ22 sequence.

I tried (unsuccessfully) the following code:

#!/usr/bin/perl
use warnings;
use strict;

if(!open(MY_HANDLE, "file.txt")){
    die "Cannot open the file\n";
    }
@content = <MY_HANDLE>;
close(MY_HANDLE);

print("Type the input ID: ");    
my $id = <STDIN>;
chomp($id);
 
foreach $row(@content){
  chomp($row);
     if (@id=$row =~ /([MT].[KA]+)/g)  {
     $numID=scalar(@id);
     print(“@id,$numID\n”);
     }
[download]

Thank you a lot for your help

Comment on Find a sequence in a multifasta files and motifs Select or Download Code

Replies are listed 'Best First'.
Re: Find a sequence in a multifasta files and motifs by choroba (Cardinal) on Jun 27, 2019 at 14:37 UTC
Use a flip-flop to detect you're inside the correct sequence. Accumulate the sequence in a variable, use matching in scalarised list context to get the number of occurrences (the Goatse/Saturn operator). If the patterns can overlap, use a look-ahead instead of a plain match. #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $seq; while (<DATA>) { if (my $line_number = (/^>JJ22$/ ... /^>/)) { chomp; $seq .= $_ unless $line_number == 1 \|\| $line_number =~ /E0/; } } my $count = () = $seq =~ /[MT].[KA]/g; say $count; __DATA__ >JJ57 MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ22 MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ41 MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML [download] The first check in the `unless` part excludes the header, the `E0` is appended to the last line number, so it excludes the next header. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Find a sequence in a multifasta files and motifs by rebkirl (Acolyte) on Jun 28, 2019 at 10:12 UTC
Thanks...I'll try to fix the code	[reply]
Re: Find a sequence in a multifasta files and motifs by tybalt89 (Monsignor) on Jun 27, 2019 at 17:12 UTC
#!/usr/bin/perl # https://perlmonks.org/?node_id=11102026 use strict; use warnings; my $data = do { local $/; <DATA> }; my $sequence = 'JJ22'; $data =~ /^>$sequence\n((?:\w+\n)*)/m or die "$sequence not found";; my $count = () = $1 =~ s/\W+//gr =~ /(?=[MT].[KA])/g; print "$sequence motif count: $count\n"; __DATA__ >JJ57 MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ22 MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ41 MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML [download]	[reply] [d/l]
Re: Find a sequence in a multifasta files and motifs by betmatt (Scribe) on Jun 28, 2019 at 01:37 UTC
First things first. Generate a separate file for each amino acid sequence. Put these flat files into a separate folder. Label each file by the FASTA code. You may then be in a position to put each line of each sequence into the elements of an array. I'm assuming that the sequences correspond to particular proteins. Looking at your code. I have never seen the following before: `if(!open(MY_HANDLE, "file.txt")){` [download] Try to keep to what you might find in a text book. Your taking a single argument as an id and then worked with an array `@id`. That needs to be fixed. ~~I am assuming that you know that your regular expression does not match what you have said in the text.~~The standard input is the protein code ID but the regular expression that is search for is a protein motif. You are jumping ahead of yourself in the code. I expect that you will modify the program to then allow for alternative protein motifs to be inputed by the user.	[reply] [d/l] [select]
Re^2: Find a sequence in a multifasta files and motifs by haukex (Archbishop) on Jun 30, 2019 at 10:53 UTC
I have never seen the following before: `if(!open(MY_HANDLE, "file.txt")){` Try to keep to what you might find in a text book. TIMTOWTDI - `if(!open(MY_HANDLE, "file.txt")){ die "Cannot open the file\n"; }` is pretty much exactly* the same as `open MY_HANDLE, "file.txt" or die "Cannot open the file\n";`, but the former might be much more familiar to someone coming from a language such as C. Personally I'd just have pointed out that generally the three-argument open and lexical filehandles have several advantages over the two-argument open and bareword filehandles, and that it's usually nicer to include the filename and $! in the error message. * Update: One difference is that `if` introduces a new scope, which would become relevant when using lexical filehandles. But as currently written, they're the same.	[reply] [d/l] [select]
Re^2: Find a sequence in a multifasta files and motifs by rebkirl (Acolyte) on Jun 28, 2019 at 10:08 UTC
Thanks for your suggestions. I fixed the regex mistake	[reply]
Re: Find a sequence in a multifasta files and motifs by holli (Abbot) on Jun 27, 2019 at 21:28 UTC
Are you working in a government lab on alien DNA? I mean, shouldnt a fasta file contain A;G;C;T instead of A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y? Edit: Thanks for the explanations, but you kinda ruined the fun. holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^2: Find a sequence in a multifasta files and motifs by kcott (Archbishop) on Jun 28, 2019 at 06:41 UTC
G'day holli, Acually, those look like amino acids. `MKIKLV...` = Methionine-Lysine-Isoleucine-Lysine-Leucine-Valine-... See "FASTA format: Sequence representation". — Ken	[reply] [d/l]
Re^2: Find a sequence in a multifasta files and motifs by choroba (Cardinal) on Jun 28, 2019 at 08:18 UTC
Translating RNA into Protein `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: Find a sequence in a multifasta files and motifs by AnomalousMonk (Archbishop) on Jun 27, 2019 at 21:42 UTC
I'm not a bio-guy, but I think this is IUPAC notation for Earthian DNA. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: Find a sequence in a multifasta files and motifs by rebkirl (Acolyte) on Jun 28, 2019 at 10:05 UTC
https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp	[reply]