comment on

I have a textfile containing many fasta protein sequences; I want to write a perl script to parse the text and for every sequence to count how many aminoacids of which type. Ex: prot1: so many of these and so many of those etc... (this is achieved below). Then I want to calculate (1)the mean hydrophobicity and (2)the number of hydrophobic and hydrophilic acids, based on this info:

1.800   A  Ala
  -3.500   BB Asx
   2.500   C  Cys   Note: Columns 1-8 must contain 1 numeric value onl
+y
  -3.500   D  Asp
  -3.500   E  Glu   Note: This file is required for amphpathic helic
   2.800   F  Phe
  -0.400   G  Gly
  -3.200   H  His
   4.500   I  Ile
  -3.900   K  Lys
   3.800   L  Leu
   1.900   M  Met
  -3.500   N  Asn
  -1.600   P  Pro
  -3.500   Q  Gln
  -4.500   R  Arg
  -0.800   S  Ser
  -0.700   T  Thr
   4.200   V  Val
  -0.900   W  Trp
  -0.490   X- Unk
  -1.300   Y  Tyr
  -3.500   ZZ Glx
  -0.490   ** ***
[download]

So far I got:

my $filename = 'all.fasta.txt';
open (my $fh, "<", $filename) or die $!;

my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;

while (<$fh>){ #Read the FASTA file.
    chomp;

    if (/>/){
        s/>//;
        $key= $_;
    }else{
        push (@{$s{$key}}, $_);
    }

}

foreach my $a (keys %s){
    my $s= join("", @{$s{$a}});
    $seq{$a}=$s;
    #print("$a\t$s\n");
}

my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V);
my $aa= join("\t", @aa);
print ("Sequence\t$aa\n");

foreach my $k (keys %seq){
    my %count; # a hash to hold the count for each amino acid in the p
+rotein.
    my @seq= split(//, $seq{$k});
    foreach my $r(@seq){
        $count{$r}++;
    }
my @row;
push(@row, $k);
    foreach my $a (@aa){
        $count{$a}||=0;
        $count{$a};   #= sprintf("%0.1f",($count{$a}/length($seq{$k}))
+*100);
        push(@row,$count{$a});
    }
my $row= join("\t",@row);
print("\n$row\n");
}
[download]

This code gives a file in which every protein is described in terms of aa`s (how many A`s, how many etc.) How do I go from here ? I need to parse the new text in which every protein is described in terms of how many aa`s of what type and for each to multiply the values from the table (I think). protein file looks like this:(here is just on protein but the file contains many)

>gi|6103257|emb|CAB07737.2| glycoprotein [Viral hemorrhagic septicemia
+ virus] MEWNTFFLVILIIIIKSTTPQITQRPPVENISTYHADWDTPLYTHPSNCREDSFVPIRPAQ
+LRCPHEFED INKGLVSVPTQIIHLPLSVTSVSAVASGHYLHRVTYRVTCSTSFFGGQTIEKTILEAKL
+SRQEATNEASK DHEYPFFPEPSCIWMKNNVHKDITHYYKTPKTVSVDLYSRKFLNPDFIEGVCTTSPC
+QTHWQGVYWVGAT ..... ...and then the next one...
[download]

In reply to perl mean hydrophobicity protein fasta by Megiddo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.