Count # of occurrences per column

Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have DNA data for a number of individuals at a few thousand markers. The data is in a matrix format with the rows being individuals and columns being markers. For example:

3851 A A G G T T A A C C ...
3854 A A G G T T A A C C ...
3871 A A G G T T A A G G ...
[download]

The first column is the individual ID, and each column after that is a marker observation. I need to count the number of occurrence of each character per column, and then output a text file such as:

column 2
3 A
column 10
2 C 1 G
[download]

I only know how to do this per column using the unix grep command, but that would be extremely time consuming and necessary. The output does not need to be as specific as I have put, as long as it gives the column number and how many occurrences of each character. Thank you for any advice.

Comment on Count # of occurrences per column Select or Download Code

Replies are listed 'Best First'.
Re: Count # of occurrences per column by ikegami (Patriarch) on May 11, 2011 at 17:58 UTC
You want to count the number of instances of each character (hash) for each column (array), so you want a AoH. `my @counts; while (<>) { my ($id, @fields) = split; for my $col_num (0..$#fields) { ++$counts[$col_num]{ $fields[$col_num] }; } }` [download] Outputting in the desired format (sorted by descending count) is tricky, though. `for my $col_num (1..$#counts) { # for my $col_num (0, 8) { ?? my $col_counts = $counts[$col_num]; say("column ", $col_num+2); my %by_count; for keys(%$col_counts) { my $count = $col_counts->{$_}; push @{ $by_count{$count} }, $_; } say join ' ', map { join(',', @{ $by_count{$_} }) } keys(%by_count); }` [download] Update: Fixed off-by-one in column number.	[reply] [d/l] [select]
Re: Count # of occurrences per column by wind (Priest) on May 11, 2011 at 17:48 UTC
`my @cols; while (<DATA>) { chomp; my ($id, @markers) = split; for my $i (0..$#markers) { $cols[$i]{$markers[$i]}++; } } for my $i (0..$#cols) { print "column $i\n"; print "$_ $cols[$i]{$_} " for sort keys %{$cols[$i]}; print "\n"; } __DATA__ 3851 A A G G T T A A C C 3854 A A G G T T A A C C 3871 A A G G T T A A G G` [download]	[reply] [d/l]
Re^2: Count # of occurrences per column by ikegami (Patriarch) on May 11, 2011 at 18:07 UTC
The column number off by two in your output. The count and character order is reversed in your output. And while it's not clear from the OP, it looks to me the output should be sorted by descending count.	[reply]
Re^3: Count # of occurrences per column by wind (Priest) on May 11, 2011 at 18:40 UTC
True, but ... from OP: The output does not need to be as specific as I have put, as long as it gives the column number and how many occurrences of each character. I prefer to simply demonstrate general concepts and not distract with needless or potentially implied minutia. Obviously it's easy enough for the OP to add 2 to the column if that's what he desires, along with adding sorting or pretty printing. Shrug Good thing I'm not a perfectionist ;)	[reply]
Re: Count # of occurrences per column by LanX (Saint) on May 12, 2011 at 09:53 UTC
just a meditation about printing in descending order... `use strict; use warnings; my $ncol; while (<DATA>){ my ($id,@acgt)=split; my (%count,@reports); # -- count ACGT $count{$_}++ for (@acgt) ; # -- print in descending order while( my ($marker,$number) = each %count ) { push @reports, "$number $marker"; } print "Column :",$ncol++,"\n"; no warnings "numeric"; print join ("\t",sort {$b <=>$a} @reports),"\n"; } __DATA__ 3851 A A G G T T A A C C 3854 A A G G T T A A C C 3871 A A G G T T A A G G` [download] OUTPUT: `Column :0 4 A 2 T 2 C 2 G Column :1 4 A 2 T 2 C 2 G Column :2 4 A 4 G 2 T` [download] Cheers Rolf	[reply] [d/l] [select]