I have a textfile containing many fasta protein sequences; I want to write a perl script to parse the text and for every sequence to count how many aminoacids of which type. Ex: prot1: so many of these and so many of those etc... (this is achieved below). Then I want to calculate (1)the mean hydrophobicity and (2)the number of hydrophobic and hydrophilic acids, based on this info:

1.800 A Ala -3.500 BB Asx 2.500 C Cys Note: Columns 1-8 must contain 1 numeric value onl +y -3.500 D Asp -3.500 E Glu Note: This file is required for amphpathic helic 2.800 F Phe -0.400 G Gly -3.200 H His 4.500 I Ile -3.900 K Lys 3.800 L Leu 1.900 M Met -3.500 N Asn -1.600 P Pro -3.500 Q Gln -4.500 R Arg -0.800 S Ser -0.700 T Thr 4.200 V Val -0.900 W Trp -0.490 X- Unk -1.300 Y Tyr -3.500 ZZ Glx -0.490 ** ***

So far I got:

my $filename = 'all.fasta.txt'; open (my $fh, "<", $filename) or die $!; my %s;# a hash of arrays, to hold each line of sequence my %seq; #a hash to hold the AA sequences. my $key; while (<$fh>){ #Read the FASTA file. chomp; if (/>/){ s/>//; $key= $_; }else{ push (@{$s{$key}}, $_); } } foreach my $a (keys %s){ my $s= join("", @{$s{$a}}); $seq{$a}=$s; #print("$a\t$s\n"); } my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V); my $aa= join("\t", @aa); print ("Sequence\t$aa\n"); foreach my $k (keys %seq){ my %count; # a hash to hold the count for each amino acid in the p +rotein. my @seq= split(//, $seq{$k}); foreach my $r(@seq){ $count{$r}++; } my @row; push(@row, $k); foreach my $a (@aa){ $count{$a}||=0; $count{$a}; #= sprintf("%0.1f",($count{$a}/length($seq{$k})) +*100); push(@row,$count{$a}); } my $row= join("\t",@row); print("\n$row\n"); }

This code gives a file in which every protein is described in terms of aa`s (how many A`s, how many etc.) How do I go from here ? I need to parse the new text in which every protein is described in terms of how many aa`s of what type and for each to multiply the values from the table (I think). protein file looks like this:(here is just on protein but the file contains many)

>gi|6103257|emb|CAB07737.2| glycoprotein [Viral hemorrhagic septicemia + virus] MEWNTFFLVILIIIIKSTTPQITQRPPVENISTYHADWDTPLYTHPSNCREDSFVPIRPAQ +LRCPHEFED INKGLVSVPTQIIHLPLSVTSVSAVASGHYLHRVTYRVTCSTSFFGGQTIEKTILEAKL +SRQEATNEASK DHEYPFFPEPSCIWMKNNVHKDITHYYKTPKTVSVDLYSRKFLNPDFIEGVCTTSPC +QTHWQGVYWVGAT ..... ...and then the next one...

In reply to perl mean hydrophobicity protein fasta by Megiddo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.