Re: To count letters (%identity) in DNA alignment

Build a hash of arrays - one array for each letter:

use strict;
use warnings;

my %frequencies;
my $maxCol = 0;

while (<DATA>) {
    chomp;

    my ($name, $seq) = split;
    
    next unless defined $seq;
    
    my @letters = split '', $seq;
    
    ++$frequencies{$letters[$_]}[$_] for 0 .. $#letters;
    $maxCol = $#letters if $maxCol < $#letters;
}

for my $letter (qw"A T G C") {
    $frequencies{$letter}[$_] ||= 0 for 0 .. $maxCol;
    print "$letter @{$frequencies{$letter}}\n";
}

__DATA__
fred  ATGTTGTAT
fred1 ATCTTATAT
fred2 ATCTTATAT
[download]

Prints:

A 3 0 0 0 0 2 0 3 0
T 0 3 0 3 3 0 3 0 3
G 0 0 1 0 0 1 0 0 0
C 0 0 2 0 0 0 0 0 0
[download]

Perl's payment curve coincides with its learning curve.

Comment on Re: To count letters (%identity) in DNA alignment Select or Download Code

Replies are listed 'Best First'.
Re^2: To count letters (%identity) in DNA alignment by graff (Chancellor) on Jan 27, 2009 at 01:32 UTC
Minor nit-pick -- I'd want to include some defensive programming, because with 10K+ records to go through, there's (almost) no such thing as being too careful. And while we're at it, if we're ready to deal with unsuitable data, might as well, point out when it happens: `... my $bad_count = 0; while (<...>) { chomp; my ( $name, $seq ) = split; unless ( $seq and $seq =~ /^[ACGT]+$/ ) { $bad_count++; next; } ... } warn "Input file had $bad_count unusable lines\n"; ...` [download] For that matter, if all the records are supposed to have the same number of letters, add that as part of the conditional.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: To count letters (%identity) in DNA alignment
by graff (Chancellor) on Jan 27, 2009 at 01:32 UTC

And while we're at it, if we're ready to deal with unsuitable data, might as well, point out when it happens:

...
my $bad_count = 0;
while (<...>) {
    chomp;
    my ( $name, $seq ) = split;
    unless ( $seq and $seq =~ /^[ACGT]+$/ ) {
        $bad_count++;
        next;
    }
    ...
}
warn "Input file had $bad_count unusable lines\n";
...
[download]

[reply]
[d/l]