rebkirl has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I am quite new to Perl scripting and I am trying to solve the following:

Given a list of mutlifasta file (file.txt):

>JJ57 MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ22 MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ41 MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML

Given a user input (i.e. JJ22) as argument, I want to check if the input is present in the file.txt and count all the motifs MT.KA (M or T, any letter (.) followed by K or A) in the JJ22 sequence.

I tried (unsuccessfully) the following code:

#!/usr/bin/perl use warnings; use strict; if(!open(MY_HANDLE, "file.txt")){ die "Cannot open the file\n"; } @content = <MY_HANDLE>; close(MY_HANDLE); print("Type the input ID: "); my $id = <STDIN>; chomp($id); foreach $row(@content){ chomp($row); if (@id=$row =~ /([MT].[KA]+)/g) { $numID=scalar(@id); print(“@id,$numID\n”); }

Thank you a lot for your help

Replies are listed 'Best First'.
Re: Find a sequence in a multifasta files and motifs
by choroba (Cardinal) on Jun 27, 2019 at 14:37 UTC
    Use a flip-flop to detect you're inside the correct sequence. Accumulate the sequence in a variable, use matching in scalarised list context to get the number of occurrences (the Goatse/Saturn operator). If the patterns can overlap, use a look-ahead instead of a plain match.
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $seq; while (<DATA>) { if (my $line_number = (/^>JJ22$/ ... /^>/)) { chomp; $seq .= $_ unless $line_number == 1 || $line_number =~ /E0/; } } my $count = () = $seq =~ /[MT].[KA]/g; say $count; __DATA__ >JJ57 MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ22 MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ41 MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML

    The first check in the unless part excludes the header, the E0 is appended to the last line number, so it excludes the next header.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Thanks...I'll try to fix the code
Re: Find a sequence in a multifasta files and motifs
by tybalt89 (Monsignor) on Jun 27, 2019 at 17:12 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=11102026 use strict; use warnings; my $data = do { local $/; <DATA> }; my $sequence = 'JJ22'; $data =~ /^>$sequence\n((?:\w+\n)*)/m or die "$sequence not found";; my $count = () = $1 =~ s/\W+//gr =~ /(?=[MT].[KA])/g; print "$sequence motif count: $count\n"; __DATA__ >JJ57 MKIKLVTVGDAKEEYLIQGINEYLKRLNSYAKRETIEVPDEKAPEKLSDAEMLQVKEKEGEYVFVLAI NGKQLSSEEFSKEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ22 MDQNGASGSHPNRASTRKGAHARERGATVSAMSANRSNIIDEMAKICEADRQTFAIARRTRNESQ FFGFRTASNKAIEITEAMEKRGAMFLTQSKATDQLNGWQPSDEPDKTSAESEPWFRGKQLSSEEFS KEIFQTGISGKSNLTFTTCFSLGLSDSVLQRIMKGEPYHKL >JJ41 MWKTVAPIFAAIFAVGLCGTFRTNTRKGEPTTKCFVFVHDTKARIYQCTFKTWSCPWLNNIVSAQF QFVTGANYKIVVKLVGELFTETALFNWSSPTTIFTGLGTLITADKTLDCDSNML
Re: Find a sequence in a multifasta files and motifs
by betmatt (Scribe) on Jun 28, 2019 at 01:37 UTC
    First things first. Generate a separate file for each amino acid sequence. Put these flat files into a separate folder. Label each file by the FASTA code. You may then be in a position to put each line of each sequence into the elements of an array. I'm assuming that the sequences correspond to particular proteins.


    Looking at your code. I have never seen the following before:

    if(!open(MY_HANDLE, "file.txt")){
    Try to keep to what you might find in a text book.

    Your taking a single argument as an id and then worked with an array @id. That needs to be fixed.

    I am assuming that you know that your regular expression does not match what you have said in the text.The standard input is the protein code ID but the regular expression that is search for is a protein motif. You are jumping ahead of yourself in the code. I expect that you will modify the program to then allow for alternative protein motifs to be inputed by the user.
      I have never seen the following before: if(!open(MY_HANDLE, "file.txt")){ Try to keep to what you might find in a text book.

      TIMTOWTDI - if(!open(MY_HANDLE, "file.txt")){ die "Cannot open the file\n"; } is pretty much exactly* the same as open MY_HANDLE, "file.txt" or die "Cannot open the file\n";, but the former might be much more familiar to someone coming from a language such as C.

      Personally I'd just have pointed out that generally the three-argument open and lexical filehandles have several advantages over the two-argument open and bareword filehandles, and that it's usually nicer to include the filename and $! in the error message.

      * Update: One difference is that if introduces a new scope, which would become relevant when using lexical filehandles. But as currently written, they're the same.

      Thanks for your suggestions. I fixed the regex mistake
Re: Find a sequence in a multifasta files and motifs
by holli (Abbot) on Jun 27, 2019 at 21:28 UTC
    Are you working in a government lab on alien DNA? I mean, shouldnt a fasta file contain A;G;C;T instead of A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y?

    Edit: Thanks for the explanations, but you kinda ruined the fun.


    holli

    You can lead your users to water, but alas, you cannot drown them.

      I'm not a bio-guy, but I think this is IUPAC notation for Earthian DNA.


      Give a man a fish:  <%-{-{-{-<

      https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp