345qwerty has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

I have two files with thousands of proteins: 1 file = protein ID + sequence of amino acids; 2 file = protein ID+sequence of nucleotides. And my third file is usual file with position of domains which are related with my proteins and nucleotides files. I related these three files with this code:

acids.txt file contains:

>ENST00000274849|Q9ULW3 MEAEESEKAATEQEPLEGTEQTLDAEEEQEESEEAACGSKKRVVPGIVYLGHIPPRFRPL HVRNLLSAYGEVGRVFFQAEDRFVRRKKKAAAAAGGKKRSYTKDYTEGWVEFRDKRIAKR VAASLHNTPMGARRRSPFRYDLWNLKYLHRFTWSH*

FOR EXAMPLE

nucleotides.txt file contains:

>ENST00000274849|Q9ULW3 ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG CACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACTGA

domain.txt file contains:

Q9ULW3;    46    142

NOTE: this numbers mean the position domain in my sequence

use strict; use Bio::SeqIO; #################################################### #MODULE 1: read protein file, and save it in a hash# #################################################### my %hash1; my $sequence = "acids.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fa +sta"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $des=$seq->description; my $ss = $seq->seq; $hash1{$na} = $ss; } ############################################################# #MODULE 2: read nucleotide file, and save it in another hash# ############################################################# my %hash2; my $genes = "nucleotides.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta +"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $des=$seq->description; my $ss = $seq->seq; $hash2{$na} = $ss; } ##################### #MODULE 3: my $name;# ##################### my $name; # Read from standard input chomp $name; ###################################################################### +#### #MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN +COLUMNS# ###################################################################### +#### foreach my $name (keys %hash1) { my $ac = (split(/\s*\|/, $name))[1]; #print "$ac\n" ; ################################################# #MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS# ################################################# open(FILE, "<" ,"domain.txt"); my @array = (<FILE>); my @lines = grep (/$ac/, @array); print for @lines; close (FILE); ####################################################### #MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS# ######################################################## my @array1 = split(//, $hash1{$name}, $hash2{$name}); #CUT SEQUEN +CE my @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); + #CUT NUCLEOTIDE SEQUENCE my $number = "$#array1+1"; foreach (my $count = 0; $count <= $number; $count++) { print "$count\t@array1[$count]\t@array2[$count]\n"; } }

And here is my FILE which I got after running the script:

Q9ULW3; 46 142 0 M ATG 1 E GAG 2 A GCA 3 E GAG 4 E GAA 5 S TCG 6 E GAG 7 K AAG 8 A GCC 9 A GCA 10 T ACG 11 E GAG 12 Q CAA 13 E GAG 14 P CCG 15 L CTG 16 E GAA 17 G GGG 18 T ACA 19 E GAA 20 Q CAG 21 T ACA 22 L CTA 23 D GAT 24 A GCG 25 E GAG 26 E GAG 27 E GAG 28 Q CAG 29 E GAG 30 E GAA 31 S TCC 32 E GAA 33 E GAA 34 A GCG 35 A GCC 36 C TGT 37 G GGC 38 S AGC 39 K AAG 40 K AAA 41 R CGG 42 V GTA 43 V GTG 44 P CCA 45 G GGT 46 I ATT 47 V GTG 48 Y TAC 49 L CTG 50 G GGC 51 H CAT 52 I ATC 53 P CCG 54 P CCG 55 R CGC 56 F TTC 57 R CGG 58 P CCC 59 L CTG 60 H CAC 61 V GTC 62 R CGC 63 N AAC 64 L CTT 65 L CTC 66 S AGC 67 A GCC 68 Y TAT 69 G GGC 70 E GAG 71 V GTC 72 G GGA 73 R CGC 74 V GTC 75 F TTC 76 F TTT 77 Q CAG 78 A GCT 79 E GAG 80 D GAC 81 R CGG 82 F TTC 83 V GTG 84 R AGA 85 R CGC 86 K AAG 87 K AAG 88 K AAG 89 A GCA 90 A GCA 91 A GCA 92 A GCT 93 A GCC 94 G GGA 95 G GGA 96 K AAA 97 K AAG 98 R CGG 99 S TCC 100 Y TAC 101 T ACC 102 K AAG 103 D GAC 104 Y TAC 105 T ACC 106 E GAG 107 G GGA 108 W TGG 109 V GTG 110 E GAG 111 F TTC 112 R CGT 113 D GAC 114 K AAG 115 R CGC 116 I ATA 117 A GCC 118 K AAG 119 R CGC 120 V GTG 121 A GCG 122 A GCC 123 S AGT 124 L CTA 125 H CAC 126 N AAC 127 T ACG 128 P CCT 129 M ATG 130 G GGT 131 A GCC 132 R CGC 133 R AGG 134 R CGC 135 S AGC 136 P CCC 137 F TTC 138 R CGT 139 Y TAT 140 D GAT 141 L CTT 142 W TGG 143 N AAC 144 L CTC 145 K AAG 146 Y TAC 147 L TTG 148 H CAC 149 R CGT 150 F TTC 151 T ACC 152 W TGG 153 S TCC 154 H CAC 155 L CTC

Now I should add a new fourth column which will contain 'YES' or 'NOT' - it depends on which codons are in domain - YES, which are not in domain - NOT. So, here is domains in the positions from 46 till 142. I would like to get this OUTPUT FILE:

Q9ULW3; 46 142 0 M ATG NOT 1 E GAG NOT 2 A GCA NOT 3 E GAG NOT 4 E GAA NOT 5 S TCG NOT 6 E GAG NOT 7 K AAG NOT 8 A GCC NOT 9 A GCA NOT 10 T ACG NOT 11 E GAG NOT 12 Q CAA NOT 13 E GAG NOT 14 P CCG NOT 15 L CTG NOT 16 E GAA NOT 17 G GGG NOT 18 T ACA NOT 19 E GAA NOT 20 Q CAG NOT 21 T ACA NOT 22 L CTA NOT 23 D GAT NOT 24 A GCG NOT 25 E GAG NOT 26 E GAG NOT 27 E GAG NOT 28 Q CAG NOT 29 E GAG NOT 30 E GAA NOT 31 S TCC NOT 32 E GAA NOT 33 E GAA NOT 34 A GCG NOT 35 A GCC NOT 36 C TGT NOT 37 G GGC NOT 38 S AGC NOT 39 K AAG NOT 40 K AAA NOT 41 R CGG NOT 42 V GTA NOT 43 V GTG NOT 44 P CCA NOT 45 G GGT NOT 46 I ATT YES 47 V GTG YES 48 Y TAC YES 49 L CTG YES 50 G GGC YES 51 H CAT YES 52 I ATC YES 53 P CCG YES 54 P CCG YES 55 R CGC YES 56 F TTC YES 57 R CGG YES 58 P CCC YES 59 L CTG YES 60 H CAC YES 61 V GTC YES 62 R CGC YES 63 N AAC YES 64 L CTT YES 65 L CTC YES 66 S AGC YES 67 A GCC YES 68 Y TAT YES 69 G GGC YES 70 E GAG YES 71 V GTC YES 72 G GGA YES 73 R CGC YES 74 V GTC YES 75 F TTC YES 76 F TTT YES 77 Q CAG YES 78 A GCT YES 79 E GAG YES 80 D GAC YES 81 R CGG YES 82 F TTC YES 83 V GTG YES 84 R AGA YES 85 R CGC YES 86 K AAG YES 87 K AAG YES 88 K AAG YES 89 A GCA YES 90 A GCA YES 91 A GCA YES 92 A GCT YES 93 A GCC YES 94 G GGA YES 95 G GGA YES 96 K AAA YES 97 K AAG YES 98 R CGG YES 99 S TCC YES 100 Y TAC YES 101 T ACC YES 102 K AAG YES 103 D GAC YES 104 Y TAC YES 105 T ACC YES 106 E GAG YES 107 G GGA YES 108 W TGG YES 109 V GTG YES 110 E GAG YES 111 F TTC YES 112 R CGT YES 113 D GAC YES 114 K AAG YES 115 R CGC YES 116 I ATA YES 117 A GCC YES 118 K AAG YES 119 R CGC YES 120 V GTG YES 121 A GCG YES 122 A GCC YES 123 S AGT YES 124 L CTA YES 125 H CAC YES 126 N AAC YES 127 T ACG YES 128 P CCT YES 129 M ATG YES 130 G GGT YES 131 A GCC YES 132 R CGC YES 133 R AGG YES 134 R CGC YES 135 S AGC YES 136 P CCC YES 137 F TTC YES 138 R CGT YES 139 Y TAT YES 140 D GAT YES 141 L CTT YES 142 W TGG YES 143 N AAC NOT 144 L CTC NOT 145 K AAG NOT 146 Y TAC NOT 147 L TTG NOT 148 H CAC NOT 149 R CGT NOT 150 F TTC NOT 151 T ACC NOT 152 W TGG NOT 153 S TCC NOT 154 H CAC NOT 155 L CTC NOT

This is example for one protein, I have to do it for thousands proteins. Please, do you have any suggestions?

Thank you!

Replies are listed 'Best First'.
Re: Find codon usage in domain based on domain's position from a file
by choroba (Cardinal) on Aug 17, 2017 at 09:30 UTC
    I tried this. There might be an easier BioPerl solution, but I don't know BioPerl.

    open my $IN, '<', 'domain.txt' or die $!; while (<$IN>) { my ($id, $from, $to) = split /[;\s]+/; print "$id; $from $to\n"; for my $pos (0 .. length($hash1{$id})) { print join ' ', $pos, substr($hash1{$id}, $pos, 1), substr($hash2{$id}, $pos * 3, 3), $from <= $pos && $pos <= $to ? 'YES' : 'NOT', "\n"; } }

    Update: The output is now almost the same as yours. The problem was caused by wrong formatting of the post - please use code tags for data, too.

    Update 2: This is how I populated the hashes. Again, no BioPerl.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Now I added <code> in my data

Re: Find codon usage in domain based on domain's position from a file
by LanX (Saint) on Aug 17, 2017 at 09:19 UTC
    Welcome to biomonks ;)

    Please explain

    • codon
    • domain
    • multifasta
    • amino acids,
    • nucleotides.
    • protwins

    Otherwise please try to phrase your question in a generally understood language?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Find codon usage in domain based on domain's position from a file
by marto (Cardinal) on Aug 17, 2017 at 09:16 UTC

    My first suggestion would be to put code tags around your data, like it tells you every time you post:

    "Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!"

Re: Find codon usage in domain based on domain's position from a file
by choroba (Cardinal) on Aug 18, 2017 at 08:29 UTC
    Crossposted to StackOverflow. Crossposting is fine, but it's considered polite to inform about it to prevent unnecessary work from people who don't attend both the sites.

    Also, you probably posted the question there because my solution here didn't work for you. What was the problem?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,