in reply to To count letters (%identity) in DNA alignment

Build a hash of arrays - one array for each letter:

use strict; use warnings; my %frequencies; my $maxCol = 0; while (<DATA>) { chomp; my ($name, $seq) = split; next unless defined $seq; my @letters = split '', $seq; ++$frequencies{$letters[$_]}[$_] for 0 .. $#letters; $maxCol = $#letters if $maxCol < $#letters; } for my $letter (qw"A T G C") { $frequencies{$letter}[$_] ||= 0 for 0 .. $maxCol; print "$letter @{$frequencies{$letter}}\n"; } __DATA__ fred ATGTTGTAT fred1 ATCTTATAT fred2 ATCTTATAT

Prints:

A 3 0 0 0 0 2 0 3 0 T 0 3 0 3 3 0 3 0 3 G 0 0 1 0 0 1 0 0 0 C 0 0 2 0 0 0 0 0 0

Perl's payment curve coincides with its learning curve.

Replies are listed 'Best First'.
Re^2: To count letters (%identity) in DNA alignment
by graff (Chancellor) on Jan 27, 2009 at 01:32 UTC
    Minor nit-pick -- I'd want to include some defensive programming, because with 10K+ records to go through, there's (almost) no such thing as being too careful.

    And while we're at it, if we're ready to deal with unsuitable data, might as well, point out when it happens:

    ... my $bad_count = 0; while (<...>) { chomp; my ( $name, $seq ) = split; unless ( $seq and $seq =~ /^[ACGT]+$/ ) { $bad_count++; next; } ... } warn "Input file had $bad_count unusable lines\n"; ...
    For that matter, if all the records are supposed to have the same number of letters, add that as part of the conditional.