Megiddo has asked for the wisdom of the Perl Monks concerning the following question:

I have a textfile containing many fasta protein sequences; I want to write a perl script to parse the text and for every sequence to count how many aminoacids of which type. Ex: prot1: so many of these and so many of those etc... (this is achieved below). Then I want to calculate (1)the mean hydrophobicity and (2)the number of hydrophobic and hydrophilic acids, based on this info:

1.800 A Ala -3.500 BB Asx 2.500 C Cys Note: Columns 1-8 must contain 1 numeric value onl +y -3.500 D Asp -3.500 E Glu Note: This file is required for amphpathic helic 2.800 F Phe -0.400 G Gly -3.200 H His 4.500 I Ile -3.900 K Lys 3.800 L Leu 1.900 M Met -3.500 N Asn -1.600 P Pro -3.500 Q Gln -4.500 R Arg -0.800 S Ser -0.700 T Thr 4.200 V Val -0.900 W Trp -0.490 X- Unk -1.300 Y Tyr -3.500 ZZ Glx -0.490 ** ***

So far I got:

my $filename = 'all.fasta.txt'; open (my $fh, "<", $filename) or die $!; my %s;# a hash of arrays, to hold each line of sequence my %seq; #a hash to hold the AA sequences. my $key; while (<$fh>){ #Read the FASTA file. chomp; if (/>/){ s/>//; $key= $_; }else{ push (@{$s{$key}}, $_); } } foreach my $a (keys %s){ my $s= join("", @{$s{$a}}); $seq{$a}=$s; #print("$a\t$s\n"); } my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V); my $aa= join("\t", @aa); print ("Sequence\t$aa\n"); foreach my $k (keys %seq){ my %count; # a hash to hold the count for each amino acid in the p +rotein. my @seq= split(//, $seq{$k}); foreach my $r(@seq){ $count{$r}++; } my @row; push(@row, $k); foreach my $a (@aa){ $count{$a}||=0; $count{$a}; #= sprintf("%0.1f",($count{$a}/length($seq{$k})) +*100); push(@row,$count{$a}); } my $row= join("\t",@row); print("\n$row\n"); }

This code gives a file in which every protein is described in terms of aa`s (how many A`s, how many etc.) How do I go from here ? I need to parse the new text in which every protein is described in terms of how many aa`s of what type and for each to multiply the values from the table (I think). protein file looks like this:(here is just on protein but the file contains many)

>gi|6103257|emb|CAB07737.2| glycoprotein [Viral hemorrhagic septicemia + virus] MEWNTFFLVILIIIIKSTTPQITQRPPVENISTYHADWDTPLYTHPSNCREDSFVPIRPAQ +LRCPHEFED INKGLVSVPTQIIHLPLSVTSVSAVASGHYLHRVTYRVTCSTSFFGGQTIEKTILEAKL +SRQEATNEASK DHEYPFFPEPSCIWMKNNVHKDITHYYKTPKTVSVDLYSRKFLNPDFIEGVCTTSPC +QTHWQGVYWVGAT ..... ...and then the next one...

Replies are listed 'Best First'.
Re: perl mean hydrophobicity protein fasta
by Eily (Monsignor) on Sep 09, 2016 at 16:38 UTC

    You forgot to translate your problem into a perl problem, this means that even if we know how to solve whatever perl problem you're having, it's hard to even understand the question.

    Your code doesn't help. You should have more explicit names for your variables (avoid @aa, even if you know it means aminoacids, it kind of looks like you didn't want to give a proper name and couldn't use @a because it was already taken). Avoid names with a single letter, especially $a which is a special variable (so some functions and modules may change its value). And most of all, avoid having several variables with the same name; even if the different types is enough to tell them appart, sigil invariance (the @ and % that sometimes become $) just makes the whole thing confusing.

    Since all your data is on a single line, which you do not split while reading the input file, it all gets into $key, and nothing is pushed into the array. So either the sample data you gave us doesn't formatted as in your input file, or there's something wrong with your code.

    Here's a little something, from what I guessed you are trying to do:

    use strict; use warnings; use Data::Dumper; my %reference; $\ = "\n"; while (<DATA>) { # Fill an array with the information from the reference table $reference{$2} = {Value => $1, Name => $3} if /(\S+)\s+(\S+)\s+(\S+) +/; # Three groups of non space characters separated by blanks } # Show the content of the hash print Dumper \%reference; # quotemeta adds a \ in front of special chars (here *) so that they l +ose their special meaning in the regex # map applies the expression on all the elements in the list, so here +this is a quotemeta applied on all the keys my $pattern = join "|", map quotemeta, keys %reference; my $sequence = 'ABBAPERL**'; # With this method, BB and ** become a single element, not two as in s +plit // my @acids = $sequence =~ /($pattern)/g; print "Splitted sequence is: @acids\n"; my %count; my $sum = 0; for my $acid (@acids) { # Translate the name with the reference table print "Found $reference{$acid}{Name}"; $count{$acid}++; $sum += $reference{$acid}{Value}; } print "Sum: $sum\n"; # Bonus: a quick way to translate all the acids into their longer name +, using map to apply the translation on the whole list print join " ", map $reference{$_}{Name}, @acids; __DATA__ 1.800 A Ala -3.500 BB Asx 2.500 C Cys Note: Columns 1-8 must contain 1 numeric value only -3.500 D Asp -3.500 E Glu Note: This file is required for amphpathic helic 2.800 F Phe -0.400 G Gly -3.200 H His 4.500 I Ile -3.900 K Lys 3.800 L Leu 1.900 M Met -3.500 N Asn -1.600 P Pro -3.500 Q Gln -4.500 R Arg -0.800 S Ser -0.700 T Thr 4.200 V Val -0.900 W Trp -0.490 X- Unk -1.300 Y Tyr -3.500 ZZ Glx -0.490 ** ***
    My variable names may not be that good, but you understand what everything means better than me.

      Thank you. It helps. I think I`ll get to the bottom of this. Sorry for not being clear.