Alignment and matrix generation

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks,

I am working on an alignment file. I am trying to print the number of occurrence of individual bases (a, c, g, t) at each position. I seek some help to print it for the multiple alignment lines.

Test file (test.dat) has single alignment column and data file has 2 alignment column.

Output from test.dat is given below.

Test.dat file

CLUSTAL O(1.2.1) multiple sequence alignment


gnl|hbvcds|AB014370_PreC_P-A       -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_PreC_P-A       -----------------------------------
+-------------------------
gnl|hbvcds|AB014384_C_P-C          -----------------------------------
+-------------------------
gnl|hbvcds|AB014385_C_P-C          -----------------------------------
+-------------------------
gnl|hbvcds|AB048701_PreS1_P-D      atggggcagaatctttccaccagcaatcctctggg
+attctttcccgaccatcagttggat
gnl|hbvcds|AB078031_PreS1_P-D      atggggcagaatctttccaccagcaaccctctggg
+attctttcccgaccaccagttggat
gnl|hbvcds|AB030513_S_P-A          -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_S_P-A          -----------------------------------
+-------------------------
gnl|hbvcds|AB194947_PreS2_P-E      -----------------------------------
+-------------------------
gnl|hbvcds|AB194948_PreS2_P-E      -----------------------------------
+-------------------------
[download]

#!/usr/bin/perl
use strict;
use warnings;

my ($rna,$rnalen,$posi, $pos, $base, $dibase,$percent);
my (%bases, %occur, %count); my($max_position, $sum, $tot) = 0;
my @dinucleotide = qw(A C G T);#change the number of bases

my $file = $ARGV[0] or die "Please provide the input file\n";
my $column = $ARGV[1];
open (X, $file) || die "Check the input $file\n";
while (<X>) {
#    chomp $_;
    if ($_=~/^#/) { next; }
    if ($_=~/^\s+/) { next; }
    if ($_=~/^CLUSTAL/) { next; }
    if ($_=~/\*/) { next; }
    my @temp = split (/\s+/, $_);
    $rna = $temp[$column];
    $rna = uc $rna;
    $rnalen = length($rna);
    my @nucleotides = split (//,$rna);
    $tot++;    
    foreach my $key(@dinucleotide) {
    $bases{$key}=1;
    $sum++;    
    }
    $max_position = $rnalen if($rnalen > $max_position);
    for (my $position=0;$position<=$rnalen-1;$position++) {
    $dibase = substr($rna, $position, 1); #change the splitting number
    foreach my $dinucleotide(@dinucleotide)    {
        if ($dibase eq $dinucleotide){
        $posi = $position+1;        
        $occur{$posi,$dibase}++;
#        print "$posi\t$dibase\n";        
        }
    }
    }
}
for($pos=1;$pos<=$max_position;$pos++) {
    print "$pos\t";
    foreach $base(sort keys %bases) {
    print "$base\t";            
    if(exists $occur{$pos,$base}) {    
        my $percent = sprintf ("%0.1f", (($occur{$pos,$base})/$tot)*10
+0);
#        print "$occur{$pos,$base}\t";
        print "$percent\t";
    }
    else {
        print "0\t";
    }
    }
    print "\n";
}
close (X);
[download]

Result from test.dat

1       A       20.0    C       0       G       0       T       0
2       A       0       C       0       G       0       T       20.0
3       A       0       C       0       G       20.0    T       0
4       A       0       C       0       G       20.0    T       0
5       A       0       C       0       G       20.0    T       0
6       A       0       C       0       G       20.0    T       0
7       A       0       C       20.0    G       0       T       0
8       A       20.0    C       0       G       0       T       0
9       A       0       C       0       G       20.0    T       0
10      A       20.0    C       0       G       0       T       0
11      A       20.0    C       0       G       0       T       0
12      A       0       C       0       G       0       T       20.0
13      A       0       C       20.0    G       0       T       0
14      A       0       C       0       G       0       T       20.0
15      A       0       C       0       G       0       T       20.0
16      A       0       C       0       G       0       T       20.0
17      A       0       C       20.0    G       0       T       0
18      A       0       C       20.0    G       0       T       0
19      A       20.0    C       0       G       0       T       0
20      A       0       C       20.0    G       0       T       0
21      A       0       C       20.0    G       0       T       0
22      A       20.0    C       0       G       0       T       0
23      A       0       C       0       G       20.0    T       0
24      A       0       C       20.0    G       0       T       0
25      A       20.0    C       0       G       0       T       0
26      A       20.0    C       0       G       0       T       0
27      A       0       C       10.0    G       0       T       10.0
28      A       0       C       20.0    G       0       T       0
29      A       0       C       20.0    G       0       T       0
30      A       0       C       0       G       0       T       20.0
31      A       0       C       20.0    G       0       T       0
32      A       0       C       0       G       0       T       20.0
33      A       0       C       0       G       20.0    T       0
34      A       0       C       0       G       20.0    T       0
35      A       0       C       0       G       20.0    T       0
36      A       20.0    C       0       G       0       T       0
37      A       0       C       0       G       0       T       20.0
38      A       0       C       0       G       0       T       20.0
39      A       0       C       20.0    G       0       T       0
40      A       0       C       0       G       0       T       20.0
41      A       0       C       0       G       0       T       20.0
42      A       0       C       0       G       0       T       20.0
43      A       0       C       20.0    G       0       T       0
44      A       0       C       20.0    G       0       T       0
45      A       0       C       20.0    G       0       T       0
46      A       0       C       0       G       20.0    T       0
47      A       20.0    C       0       G       0       T       0
48      A       0       C       20.0    G       0       T       0
49      A       0       C       20.0    G       0       T       0
50      A       20.0    C       0       G       0       T       0
51      A       0       C       10.0    G       0       T       10.0
52      A       0       C       20.0    G       0       T       0
53      A       20.0    C       0       G       0       T       0
54      A       0       C       0       G       20.0    T       0
55      A       0       C       0       G       0       T       20.0
56      A       0       C       0       G       0       T       20.0
57      A       0       C       0       G       20.0    T       0
58      A       0       C       0       G       20.0    T       0
59      A       20.0    C       0       G       0       T       0
60      A       0       C       0       G       0       T       20.0
[download]

Data file

CLUSTAL O(1.2.1) multiple sequence alignment

                                   1                                  
+                        60 
gnl|hbvcds|AB014370_PreC_P-A       -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_PreC_P-A       -----------------------------------
+-------------------------
gnl|hbvcds|AB014384_C_P-C          -----------------------------------
+-------------------------
gnl|hbvcds|AB014385_C_P-C          -----------------------------------
+-------------------------
gnl|hbvcds|AB048701_PreS1_P-D      atggggcagaatctttccaccagcaatcctctggg
+attctttcccgaccatcagttggat
gnl|hbvcds|AB078031_PreS1_P-D      atggggcagaatctttccaccagcaaccctctggg
+attctttcccgaccaccagttggat
gnl|hbvcds|AB030513_S_P-A          -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_S_P-A          -----------------------------------
+-------------------------
gnl|hbvcds|AB194947_PreS2_P-E      -----------------------------------
+-------------------------
gnl|hbvcds|AB194948_PreS2_P-E      -----------------------------------
+-------------------------

                                   61                                 
+                        120
gnl|hbvcds|AB014370_PreC_P-A       tagagtctcctgagcattgctcacctcaccatact
+gcactcaggcaagccattctctgct
gnl|hbvcds|AB064314_PreC_P-A       tagagtctcctgagcattgctcacctcaccatacg
+gcactcaggcaagccattctctgct
gnl|hbvcds|AB014384_C_P-C          tagagtctccggaacattgttcacctcaccataca
+gcactcaggcaagctattctgtgtt
gnl|hbvcds|AB014385_C_P-C          tagagtctccggaacattgttcacctcaccataca
+gcactcaggcaagctattctgtgtt
gnl|hbvcds|AB048701_PreS1_P-D      gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB078031_PreS1_P-D      gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB030513_S_P-A          gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB064314_S_P-A          gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB194947_PreS2_P-E      gggtttttcttgttgacaaaaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB194948_PreS2_P-E      gggtttttcttgttgacaaaaatcctcacaatacc
+gcagagtctagactcgtggtggact
                                     *  * **  *   *       ****** **** 
+***        *    *  *    *
[download]

I am trying to get an result with all the position and occurrence of nucleotides similar to the test.dat result. Any help is greatly acknowledged

Regards

Comment on Alignment and matrix generation Select or Download Code

Replies are listed 'Best First'.
Re: Alignment and matrix generation by choroba (Cardinal) on Mar 28, 2015 at 01:19 UTC
Do you expect the output like the following? `1 A 20.0 C 0.0 G 0.0 T 0.0 2 A 0.0 C 0.0 G 0.0 T 20.0 3 A 0.0 C 0.0 G 20.0 T 0.0 4 A 0.0 C 0.0 G 20.0 T 0.0 5 A 0.0 C 0.0 G 20.0 T 0.0 ... 115 A 0.0 C 0.0 G 0.0 T 100.0 116 A 0.0 C 20.0 G 80.0 T 0.0 117 A 0.0 C 0.0 G 60.0 T 40.0 118 A 60.0 C 0.0 G 40.0 T 0.0 119 A 0.0 C 80.0 G 0.0 T 20.0 120 A 0.0 C 0.0 G 0.0 T 100.0` [download] It was generated by the following script: Read more... (5 kB) لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Alignment and matrix generation by newtoperlprog (Sexton) on Mar 30, 2015 at 16:44 UTC
Dear Choroba, Thank you for your time and helping me with the code to develop matrix for the data. However, the data file should not be having number 1 , 60 , this was provided by me for the clarification that its 60 column wide. Actual data file is like this: CLUSTAL O(1.2.1) multiple sequence alignment + gnl\|hbvcds\|AB014370_PreC_P-A ----------------------------------- +------------------------- gnl\|hbvcds\|AB064314_PreC_P-A ----------------------------------- +------------------------- gnl\|hbvcds\|AB014384_C_P-C ----------------------------------- +------------------------- gnl\|hbvcds\|AB014385_C_P-C ----------------------------------- +------------------------- gnl\|hbvcds\|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg +attctttcccgaccatcagttggat gnl\|hbvcds\|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg +attctttcccgaccaccagttggat gnl\|hbvcds\|AB030513_S_P-A ----------------------------------- +------------------------- gnl\|hbvcds\|AB064314_S_P-A ----------------------------------- +------------------------- gnl\|hbvcds\|AB194947_PreS2_P-E ----------------------------------- +------------------------- gnl\|hbvcds\|AB194948_PreS2_P-E ----------------------------------- +------------------------- + + gnl\|hbvcds\|AB014370_PreC_P-A tagagtctcctgagcattgctcacctcaccatact +gcactcaggcaagccattctctgct gnl\|hbvcds\|AB064314_PreC_P-A tagagtctcctgagcattgctcacctcaccatacg +gcactcaggcaagccattctctgct gnl\|hbvcds\|AB014384_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl\|hbvcds\|AB014385_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl\|hbvcds\|AB048701_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl\|hbvcds\|AB078031_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl\|hbvcds\|AB030513_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl\|hbvcds\|AB064314_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl\|hbvcds\|AB194947_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact gnl\|hbvcds\|AB194948_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact * * ** * * **** +* * * * * [download] Please let me know where I can modify the code to incorporate the changes. Regards	[reply] [d/l]
Re^3: Alignment and matrix generation by choroba (Cardinal) on Mar 30, 2015 at 20:59 UTC
Just few changes: --- 1.pl 2015-03-30 22:58:28.434954333 +0200 +++ 2.pl 2015-03-30 22:58:11.261833250 +0200 @@ -3,16 +3,18 @@ use warnings; use Syntax::Construct qw{ // }; -my ($startpos, $endpos, $count); +my $endpos = 0; +my ($startpos, $count); my %occurrences; while (<DATA>) { - if (/^\s+ ([0-9]+) \s+ ([0-9]+) \s$/x) { - ($startpos, $endpos) = ($1, $2); + if (/^ +$/) { + $startpos = $endpos + 1; $count = 0; } elsif (/\s+ ([-actg]+) \s$/x) { ++$count; my @nucleotides = split //, $1; + $endpos = $endpos + length $1 if $startpos == $endpos + 1; for my $pos (0 .. $#nucleotides) { ++$occurrences{ $nucleotides[$pos] }[$startpos + $pos] unless '-' eq $nucleotides[$pos]; [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: Alignment and matrix generation by newtoperlprog (Sexton) on Mar 31, 2015 at 15:45 UTC
Re^5: Alignment and matrix generation by choroba (Cardinal) on Mar 31, 2015 at 15:57 UTC
Some notes below your chosen depth have not been shown here