Re^3: calculation of charged amino acids

sample input is:

perl count_charge_aa.pl <filename.extension>

this is the code i'm suggested with

 
#!/usr/bin/perl -w
use strict;
use warnings;

my $file = $ARGV[0];

my @currentProtein;
my %protein;

open (FASTA, "<", $file) || die "Can't open $file\n";
my $fastaLine; 

my $protSequence;


while (<FASTA>) 

  {
    #print STDERR "$_";
    chomp;
    if (($_ !~ /^>/) && ($_ =~ /\w/)) 
      {
      #@currentProtein= $_;
      push (@currentProtein, $_);
      }
    if (/^>/) 
      {
          $fastaLine = $_;
      }

    if ((($_ !~ /\w/) || (eof)) && (@currentProtein > 0))
      {
        $protSequence = join("", @currentProtein);
        
        $protein{$fastaLine} = $protSequence;
        @currentProtein = ();
      }
   
    }
close FASTA;
foreach my $key (sort(keys %protein)) 
    {
      my $count_of_acidic = 0;
      my $count_of_basic = 0;
      my $count_of_neutral = 0;
      my $aa;
      my $sequence = "$protein{$key}";
      $sequence =~ s/\s//g;
      my @prot=split("",$sequence); #splits string into an array
      #print " \nThe original PROTEIN file is:\n$sequence \n";

      while(@prot)
        {
          $aa = shift (@prot);
          if($aa =~/[DNEQ]/ig)
            {
              $count_of_acidic++;
            }
          if($aa=~/[KRH]/ig)
            {
               $count_of_basic++;
            }
          if($aa=~/[DNEQKRH]/ig)
            {
              $count_of_neutral++;
            }
      }
      print "\nName: $key\n";
      print "Number of acidic amino acids:".$count_of_acidic."\n";
      print "Number of basic amino acids:".$count_of_basic."\n";
      print "Number of neutral amino acids:".$count_of_neutral."\n";
    }
[download]

Problem description

I want to calculate the number of charges amino acids (acidic, basic and neutral) in each protein sequences which is present in fasta file. Note: I have a fasta file which have numerous protein sequences

OUTPUT

This script is calculating all the amino acids throughout the file and show the result in last protein sequence entry. Its something like that:

>gi|1064567454|ref|YP_67854|lipoprotein Myxococcous strain DK6476
Number of acidic amino acids: 312454
Number of basic amino acids: 313534
Number of neutral amino acids: 54454

It is not calculating charged amino acids sequence by sequence but as a whole

Comment on Re^3: calculation of charged amino acids Select or Download Code

Replies are listed 'Best First'.
Re^4: calculation of charged amino acids by kcott (Archbishop) on Jul 24, 2013 at 12:46 UTC
I expect that's supposed to be a response to me: you've actually responded to yourself. Here's the sort of thing I think you want: #!/usr/bin/env perl use 5.010; use strict; use warnings; use autodie; my $header = ''; my (@headers_found, %header_data); open my $fasta_fh, '<', $ARGV[0]; while (<$fasta_fh>) { chomp; next if /^\s($\|#)/; if (/^>(.)$/) { push @headers_found, ($header = $1); } else { die 'Sequence data found without a header!' unless $header; for (split '', "\U$_") { /(?<a>[DNEQ])\|(?<b>[KRH])\|(?<n>.)/; ++$header_data{$header}{(keys %+)[0]}; } } } close $fasta_fh; for (@headers_found) { say "Header: $_"; say "\tAcidic: $header_data{$_}{a}"; say "\tBasic: $header_data{$_}{b}"; say "\tNeutral: $header_data{$_}{n}"; } [download] I made a slight modification to ++mtmcc's sample data: just added some comments with and without leading whitespace for testing purposes. Do note that's what sample data should look like! It's not the command you type. Here's a sample run: `$ pm_fasta_amino_charge_calc.pl pm_mtmccs_data_modified.fasta Header: DROTME_HH_Q02936 Acidic: 14 Basic: 18 Neutral: 67 Header: DROME_HH_Q02937 Acidic: 33 Basic: 35 Neutral: 136 Header: DROME_HH_Q02938 Acidic: 18 Basic: 17 Neutral: 69` [download] Notes: The way you're identifying neutral amino acids looks wrong. I note ++Cristoforo has already addressed this issue. My code takes much the same approach: if it's not acidic or basic, then it's neutral. You're splitting the sequence into individal characters then attempting to match a pattern against a single character: there's no possibility of a repeat match so the '`g`' modifier is redundant. (I see Cristoforo has discussed this also.) You're attempting a match on every charcter three times. This is not only pointless when the matches are mutually exclusive but also wastes processing and slows down your program. I've used an alternation in my code; an if/elsif/else construct would also be more efficient. Doing a case-insensitive match ('`i`' modifier) is slower than a case-sensitive match. You're doing this three times on every character individually: I would expect a noticeable slowness with the amount of data you indicate (~700,000 characters). Change the case of the entire string once before splitting and matching. -- Ken	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: calculation of charged amino acids
by kcott (Archbishop) on Jul 24, 2013 at 12:46 UTC

I expect that's supposed to be a response to me: you've actually responded to yourself.

Here's the sort of thing I think you want:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;
use autodie;

my $header = '';
my (@headers_found, %header_data);

open my $fasta_fh, '<', $ARGV[0];

while (<$fasta_fh>) {
    chomp;
    next if /^\s*($|#)/;
    
    if (/^>(.*)$/) {
        push @headers_found, ($header = $1);
    }
    else {
        die 'Sequence data found without a header!' unless $header; 
        
        for (split '', "\U$_") {
            /(?<a>[DNEQ])|(?<b>[KRH])|(?<n>.)/;
            ++$header_data{$header}{(keys %+)[0]};
        }
    }
}

close $fasta_fh;

for (@headers_found) {
    say "Header: $_";
    say "\tAcidic:  $header_data{$_}{a}";
    say "\tBasic:   $header_data{$_}{b}";
    say "\tNeutral: $header_data{$_}{n}";
}
[download]

I made a slight modification to ++mtmcc's sample data: just added some comments with and without leading whitespace for testing purposes. Do note that's what sample data should look like! It's not the command you type. Here's a sample run:

$ pm_fasta_amino_charge_calc.pl pm_mtmccs_data_modified.fasta
Header: DROTME_HH_Q02936 
        Acidic:  14
        Basic:   18
        Neutral: 67
Header: DROME_HH_Q02937 
        Acidic:  33
        Basic:   35
        Neutral: 136
Header: DROME_HH_Q02938 
        Acidic:  18
        Basic:   17
        Neutral: 69
[download]

Notes:

The way you're identifying neutral amino acids looks wrong. I note ++Cristoforo has already addressed this issue. My code takes much the same approach: if it's not acidic or basic, then it's neutral.
You're splitting the sequence into individal characters then attempting to match a pattern against a single character: there's no possibility of a repeat match so the 'g' modifier is redundant. (I see Cristoforo has discussed this also.)
You're attempting a match on every charcter three times. This is not only pointless when the matches are mutually exclusive but also wastes processing and slows down your program. I've used an alternation in my code; an if/elsif/else construct would also be more efficient.
Doing a case-insensitive match ('i' modifier) is slower than a case-sensitive match. You're doing this three times on every character individually: I would expect a noticeable slowness with the amount of data you indicate (~700,000 characters). Change the case of the entire string once before splitting and matching.

-- Ken

[reply]
[d/l]
[select]