in reply to calculation of charged amino acids

Something like this should work (file is a text file containing all the fasta sequences):

#!/usr/bin/perl -w use strict; use warnings; my $file = $ARGV[0]; my @currentProtein; my %protein; open (FASTA, "<", $file) || die "Can't open $file\n"; my $fastaLine; my $protSequence; while (<FASTA>) { #print STDERR "$_"; chomp; if (($_ !~ /^>/) && ($_ =~ /\w/)) { push (@currentProtein, $_); } if (/^>/) { $fastaLine = $_; } if ((($_ !~ /\w/) || (eof)) && (@currentProtein > 0)) { $protSequence = join("", @currentProtein); $protein{$fastaLine} = $protSequence; @currentProtein = (); } } close FASTA; foreach my $key (sort(keys %protein)) { my $count_of_acidic = 0; my $count_of_basic = 0; my $count_of_neutral = 0; my $aa; my $sequence = "$protein{$key}"; $sequence =~ s/\s//g; my @prot=split("",$sequence); #splits string into an array #print " \nThe original PROTEIN file is:\n$sequence \n"; while(@prot) { $aa = shift (@prot); if($aa =~/[DNEQ]/ig) { $count_of_acidic++; } if($aa=~/[KRH]/ig) { $count_of_basic++; } if($aa=~/[DNEQKRH]/ig) { $count_of_neutral++; } } print "\nName: $key\n"; print "Number of acidic amino acids:".$count_of_acidic."\n"; print "Number of basic amino acids:".$count_of_basic."\n"; print "Number of neutral amino acids:".$count_of_neutral."\n"; }

I hope that helps.

Replies are listed 'Best First'.
Re^2: calculation of charged amino acids
by Cristoforo (Curate) on Jul 23, 2013 at 20:24 UTC
    The 'g' switch on the three regular expressions shouldn't be there.

    If $aa =~/[DNEQ]/ig matches, then the following match, $aa=~/[KRH]/ig, fails (and the pos is reset so that the last expression, $aa=~/[DNEQKRH]/ig will match.

    However, if $aa=~/[KRH]/ig matches, then pos will be '1' and the following match, $aa=~/[DNEQKRH]/ig will fail because it will attempt to match beginning at pos 1 instead of pos 0.

    You can see this in the following code snippet.

    #!/usr/bin/perl use strict; use warnings; use 5.014; my @prot = qw/ D K /; my ($acid_cnt, $base_cnt, $neutral_cnt); while(@prot) { my $aa = shift (@prot); if($aa =~/[DNEQ]/gi) { ++$acid_cnt; } if($aa=~/[KRH]/gi) { ++$base_cnt; } say "$aa pos: ", pos($aa) // 'pos reset'; if($aa=~/[DNEQKRH]/gi) { ++$neutral_cnt; } }
    This prints

    C:\Old_Data\perlp>perl t7.pl aa D pos: pos reset aa K pos: 1

    This, in effect, counts the acid base in the neutral count but not the base.

    He probably wants the neutral count to be other than the acid or base, in which case, that regular expression should be, $aa=~/[^DNEQKRH]/i, negating the class.

    Without the 'g' switch, and negating the neutral class, the output would look like:

    C:\Old_Data\perlp>perl t9.pl test.fas Name: >DROME_HH_Q02937 Number of acidic amino acids:33 Number of basic amino acids:35 Number of neutral amino acids:136 Name: >DROME_HH_Q02938 Number of acidic amino acids:18 Number of basic amino acids:17 Number of neutral amino acids:69 Name: >DROTME_HH_Q02936 Number of acidic amino acids:14 Number of basic amino acids:18 Number of neutral amino acids:67
    Chris
Re^2: calculation of charged amino acids
by yuvraj_ghaly (Sexton) on Jul 23, 2013 at 10:31 UTC

    somehow it is calculating only one sequence.

      When I run that program on these fasta sequences:

      >DROTME_HH_Q02936 
      MRHIAHTQRCLSRLTSLVALLLIVLPMVFSPAHSCGPGRGLGRHRARNLY
      PLVLKQTIPNLSEYTNSASGPLEGVIRRDSPKFKDLVPNYNRDILFRDE
      
      >DROME_HH_Q02937 
      MRHIAHTQRCLSRLTSLVALLLIVLPMVFSPAHSCGPGRGLGRHRARNLY
      PLVLKQTIPNLSEYTNSASGPLEGVIRRDSPKFKDLVPNYNRDILFRDEE
      GTGADRLMSKRCKEKLNVLAYSVMNEWPGIRLLTTTTTTESWDEDYHHGQ
      YEGRAVTIATSDRDQSKYGMLARLAVEAGFDWVSYVSRRHIYCSVKSDSS
      ESLH
      
      >DROME_HH_Q02938 
      GTGADRLMSKRCKEKLNVLAYSVMNEWPGIRLLTTTTTTESWDEDYHHGQ
      YEGRAVTIATSDRDQSKYGMLARLAVEAGFDWVSYVSRRHIYCSVKSDSS
      ESLH
      

      I get this output:

      Name: >DROME_HH_Q02937 
      Number of acidic amino acids:33
      Number of basic amino acids:35
      Number of neutral amino acids:33
      
      Name: >DROME_HH_Q02938 
      Number of acidic amino acids:18
      Number of basic amino acids:17
      Number of neutral amino acids:18
      
      Name: >DROTME_HH_Q02936 
      Number of acidic amino acids:14
      Number of basic amino acids:18
      Number of neutral amino acids:14
      

      The fasta tags for each protein must be unique.

        are these fasta sequences in one file file or different???

        Paste all these sequences in one file, save in fasta format and then run the program. You will see the problem

          A reply falls below the community's threshold of quality. You may see it by logging in.

      so u used the same code which I input