I have a textfile containing many fasta protein sequences; I want to write a perl script to parse the text and for every sequence to count how many aminoacids of which type. Ex: prot1: so many of these and so many of those etc... (this is achieved below). Then I want to calculate (1)the mean hydrophobicity and (2)the number of hydrophobic and hydrophilic acids, based on this info:
1.800 A Ala -3.500 BB Asx 2.500 C Cys Note: Columns 1-8 must contain 1 numeric value onl +y -3.500 D Asp -3.500 E Glu Note: This file is required for amphpathic helic 2.800 F Phe -0.400 G Gly -3.200 H His 4.500 I Ile -3.900 K Lys 3.800 L Leu 1.900 M Met -3.500 N Asn -1.600 P Pro -3.500 Q Gln -4.500 R Arg -0.800 S Ser -0.700 T Thr 4.200 V Val -0.900 W Trp -0.490 X- Unk -1.300 Y Tyr -3.500 ZZ Glx -0.490 ** ***
So far I got:
my $filename = 'all.fasta.txt'; open (my $fh, "<", $filename) or die $!; my %s;# a hash of arrays, to hold each line of sequence my %seq; #a hash to hold the AA sequences. my $key; while (<$fh>){ #Read the FASTA file. chomp; if (/>/){ s/>//; $key= $_; }else{ push (@{$s{$key}}, $_); } } foreach my $a (keys %s){ my $s= join("", @{$s{$a}}); $seq{$a}=$s; #print("$a\t$s\n"); } my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V); my $aa= join("\t", @aa); print ("Sequence\t$aa\n"); foreach my $k (keys %seq){ my %count; # a hash to hold the count for each amino acid in the p +rotein. my @seq= split(//, $seq{$k}); foreach my $r(@seq){ $count{$r}++; } my @row; push(@row, $k); foreach my $a (@aa){ $count{$a}||=0; $count{$a}; #= sprintf("%0.1f",($count{$a}/length($seq{$k})) +*100); push(@row,$count{$a}); } my $row= join("\t",@row); print("\n$row\n"); }
This code gives a file in which every protein is described in terms of aa`s (how many A`s, how many etc.) How do I go from here ? I need to parse the new text in which every protein is described in terms of how many aa`s of what type and for each to multiply the values from the table (I think). protein file looks like this:(here is just on protein but the file contains many)
>gi|6103257|emb|CAB07737.2| glycoprotein [Viral hemorrhagic septicemia + virus] MEWNTFFLVILIIIIKSTTPQITQRPPVENISTYHADWDTPLYTHPSNCREDSFVPIRPAQ +LRCPHEFED INKGLVSVPTQIIHLPLSVTSVSAVASGHYLHRVTYRVTCSTSFFGGQTIEKTILEAKL +SRQEATNEASK DHEYPFFPEPSCIWMKNNVHKDITHYYKTPKTVSVDLYSRKFLNPDFIEGVCTTSPC +QTHWQGVYWVGAT ..... ...and then the next one...
In reply to perl mean hydrophobicity protein fasta by Megiddo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |