Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have DNA data for a number of individuals at a few thousand markers. The data is in a matrix format with the rows being individuals and columns being markers. For example:

3851 A A G G T T A A C C ... 3854 A A G G T T A A C C ... 3871 A A G G T T A A G G ...

The first column is the individual ID, and each column after that is a marker observation. I need to count the number of occurrence of each character per column, and then output a text file such as:

column 2 3 A column 10 2 C 1 G

I only know how to do this per column using the unix grep command, but that would be extremely time consuming and necessary. The output does not need to be as specific as I have put, as long as it gives the column number and how many occurrences of each character. Thank you for any advice.

Replies are listed 'Best First'.
Re: Count # of occurrences per column
by ikegami (Patriarch) on May 11, 2011 at 17:58 UTC

    You want to count the number of instances of each character (hash) for each column (array), so you want a AoH.

    my @counts; while (<>) { my ($id, @fields) = split; for my $col_num (0..$#fields) { ++$counts[$col_num]{ $fields[$col_num] }; } }

    Outputting in the desired format (sorted by descending count) is tricky, though.

    for my $col_num (1..$#counts) { # for my $col_num (0, 8) { ?? my $col_counts = $counts[$col_num]; say("column ", $col_num+2); my %by_count; for keys(%$col_counts) { my $count = $col_counts->{$_}; push @{ $by_count{$count} }, $_; } say join ' ', map { join(',', @{ $by_count{$_} }) } keys(%by_count); }

    Update: Fixed off-by-one in column number.

Re: Count # of occurrences per column
by wind (Priest) on May 11, 2011 at 17:48 UTC
    my @cols; while (<DATA>) { chomp; my ($id, @markers) = split; for my $i (0..$#markers) { $cols[$i]{$markers[$i]}++; } } for my $i (0..$#cols) { print "column $i\n"; print "$_ $cols[$i]{$_} " for sort keys %{$cols[$i]}; print "\n"; } __DATA__ 3851 A A G G T T A A C C 3854 A A G G T T A A C C 3871 A A G G T T A A G G

      The column number off by two in your output. The count and character order is reversed in your output. And while it's not clear from the OP, it looks to me the output should be sorted by descending count.

        True, but ... from OP:

        The output does not need to be as specific as I have put, as long as it gives the column number and how many occurrences of each character.

        I prefer to simply demonstrate general concepts and not distract with needless or potentially implied minutia. Obviously it's easy enough for the OP to add 2 to the column if that's what he desires, along with adding sorting or pretty printing.

        *Shrug* Good thing I'm not a perfectionist ;)

Re: Count # of occurrences per column
by LanX (Saint) on May 12, 2011 at 09:53 UTC
    just a meditation about printing in descending order...

    use strict; use warnings; my $ncol; while (<DATA>){ my ($id,@acgt)=split; my (%count,@reports); # -- count ACGT $count{$_}++ for (@acgt) ; # -- print in descending order while( my ($marker,$number) = each %count ) { push @reports, "$number $marker"; } print "Column :",$ncol++,"\n"; no warnings "numeric"; print join ("\t",sort {$b <=>$a} @reports),"\n"; } __DATA__ 3851 A A G G T T A A C C 3854 A A G G T T A A C C 3871 A A G G T T A A G G

    OUTPUT:

    Column :0 4 A 2 T 2 C 2 G Column :1 4 A 2 T 2 C 2 G Column :2 4 A 4 G 2 T

    Cheers Rolf