Finding Consensus String

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

My most respected monks,
I am looking a way to find consensus string from given set of strings (length is assumed to be the same within the set). Basically consensus string mean single string that represent most common occurence from the given set. Here is the example of it:

           Set 1:                         Set 2:
           AAAAA                          AAAAA
           ATCGA                          AACGA
           ATAAA                          ATAAA
                                          ATAAA
Consensus: ATAAA             Consensus:  A[AT]AAA

Note: like Set 2, there can be other cases that has
more than two consensus in a particular position.
They need to be captured within square bracket [].
[download]

Below is my incredibly naive, inefficient and inflexible code. There are two issues which I still unable to solve here:

the need to have a fast implementation of it
the need to capture those cases with multiple consensus, those to be captured by square bracket [ATCG]

So here I am again, humbly seeking for your wisdom.

My current code:

use strict;
use warnings;

my @instances = qw ( 
                     AAAAA
                     ATCGA
                     ATAAA
                 );
my @instances2 = qw ( 
                     AAAAA
                     AACGA
                     ATAAA
                     ATAAA
                 );

print consensus(@instances),"\n";

sub consensus{
 my @mi = @_;
 my $motif_count=0;
 my @words =();
 my $L = undef;

  my  @A = ();
  my  @T = ();
  my  @C = ();
  my  @G = ();

  my  $a = 0;
  my  $c = 0;
  my  $g = 0;
  my  $t = 0;
  my  $w = 0;

    foreach my $mi ( @mi ) {
        chomp($mi);
        $mi =~ s/\s//g;

        $w = length($mi);

        #for motif instances count
        @words = split( /\W+/, $mi );
        $motif_count += @words;
      }

         for ( my $j = 0 ; $j < $w ; $j++ ) {

        # Initialize the base counts.
        my $L =0;
        foreach my $mi ( @mi ) {
            chomp($mi);
            my $L = $mi;
            my $sb = substr( $L, $j, 1 );
            while ( $sb =~ /a/ig ) { $a++ }
            while ( $sb =~ /t/ig ) { $t++ }
            while ( $sb =~ /c/ig ) { $c++ }
            while ( $sb =~ /g/ig ) { $g++ }

        }
        push( @A, $a );
        push( @T, $t );
        push( @C, $c );
        push( @G, $g );

        $a = 0;
        $c = 0;
        $g = 0;
        $t = 0;

    }

    my @cons = ();
    #print "$w\n";
    for (my $b=0; $b <$w ;$b++) {

        if ( $A[$b] > $T[$b] && $A[$b] > $C[$b] && $A[$b] > $G[$b] ) {
            push( @cons, 'A');
        }

        elsif ( $T[$b] > $C[$b] && 
                $T[$b] > $A[$b] && 
                $T[$b] > $G[$b] ) {
            push( @cons, 'T' );
        }

        elsif ( $C[$b] > $G[$b] && 
                $C[$b] > $A[$b] && 
                $C[$b] > $T[$b] ) {
            push( @cons, 'C' );
        }
      elsif ( $G[$b] > $A[$b] && 
                $G[$b] > $C[$b] && 
                $G[$b] > $T[$b] ) {
            push( @cons, 'G' );
        }
        elsif ( $A[$b] = $T[$b] ) {
            push( @cons, 'T' );
        }
        elsif ( $A[$b] = $G[$b] ) {
            push( @cons, 'G' );
        }
        elsif ( $A[$b] = $C[$b] ) {
            push( @cons, 'C' );
        }
        elsif ( $T[$b] = $C[$b] ) {
            push( @cons, 'C' );
        }
        elsif ( $T[$b] = $G[$b] ) {
            push( @cons, 'G' );
        }
        else { push @cons, 'G'; }


    }

    return @cons;
}
[download]

Regards,
Edward

Comment on Finding Consensus String Select or Download Code

Replies are listed 'Best First'.
Re: Finding Consensus String by davidrw (Prior) on Oct 18, 2005 at 13:11 UTC
before addressing the core issue, i see a couple immediate code items: avoid using `$a` and `$b` as variable names -- they have special meaning (see sort) In all of these: `elsif ( $A[$b] = $T[$b] ) {` you are using the assignment operator instead of the equality operator ==, so this code will not do what you want.. not wrong as-is, but note it's "more perl-ish" to write loops like (multiple ways to write this): `foreach my $b ( 0 .. $w-1 ){` i'll update shortly after looking at the actual issue.. my initial reaction is that the solution probably involves hashes instead of arrays (and maybe a function for "find-the-max" instead of that if/else ladder).	[reply] [d/l] [select]
Re: Finding Consensus String by japhy (Canon) on Oct 18, 2005 at 13:39 UTC
Since you're dealing with frequency, I'd use a hash to store the occurrences of A, C, G, and T in each position of the word. I'd loop over the DNA sequences for each character, looking at character 1 first in all strands, then character 2, etc. `sub consensus { my $len = length($_[0]); # all strands are of one length, right? my @cons; # loop over the positions in the strands for my $i (0 .. $len - 1) { my %freq; my $max = 0; # loop over character $i in each of the strands for (map substr($_, $i, 1), @_) { $freq{$_}++; # save the max value $max = $freq{$_} if $freq{$_} > $max; } # store only those who occur the most times push @cons, [grep $freq{$_} == $max, keys %freq]; } return \@cons; }` [download] Then it's your task to turn the array of array references into the string you want. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l]
Re^2: Finding Consensus String by monkfan (Curate) on Oct 19, 2005 at 16:31 UTC
Dear japhy, Thanks so much for your answer to my posting. I just have a question. Your code works well in generalize version of array like this: `my @instances = ( 'AAAAA ATCGA', 'ATCGA AAAAA', 'ATAAA AAAAA', );` [download] But how can I make it more generalize for cases like this? `my @instances2 = ( 'AAAAA ATCGA', 'ATCGA', 'ATAAA AAAAA', ); # consensus : ATAAA A[AT][CA][GA]A` [download] Or this: `my @instances3 = ( 'AAAAA ATCGA TTTT', 'ATCGA TATA ', 'ATAAA AAAAA ATTA ', ); # consensus : ATAAA A[AT][CA][GA]A TTTA` [download] Cause, currently it gives warning when encountering two cases like that. Regards, Edward	[reply] [d/l] [select]
Re^3: Finding Consensus String by japhy (Canon) on Oct 19, 2005 at 19:18 UTC
I don't get any warnings. What I do notice is that space is included in the `[...]` parts. So here's a small fix: `sub consensus { ... # code from before, until: # store only those who occur the most times my @most = grep $freq{$_} == $max, keys %freq; @most = grep $_ ne " ", @most if @most > 1; push @cons, \@most; } return \@cons; }` [download] And to get the string version, I use: `sub consensus_to_string { my $con = shift; my $str; for (@$con) { if (@$_ > 1) { $str .= "[" . join("", @$_) . "]" } else { $str .= $_->[0] } } return $str; }` [download] Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re: Finding Consensus String by davidrw (Prior) on Oct 18, 2005 at 13:48 UTC
This doesn't change your functionality at all (i don't think), but does shorten all the code up and hopefully makes it much easier to modify from here.. Basically it makes use of hashes instead of having to declare a bunch of variables (e.g. $a, $g, $t, $c) so that everything is easier to work with. One thing i didn't understand was the `$w` usage -- i assumed in my code that you want $w to be the length of the longest item, but your code was setting it to be the length of the last item... use strict; use warnings; my @instances = qw ( AAAAA ATCGA ATAAA ); my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA ); print consensus(@instances),"\n"; # ATAAA print consensus(@instances2),"\n"; # ATAAA exit; sub consensus{ my @mi = @_; chomp(@mi); my $motif_count=0; my @words =(); my %H = ( A=>[], T=>[], C=>[], G=>[] ); s/\s//g for @mi; my ($w) = sort {$b <=> $a} map {length} @mi; # set w to the lengt +h of the longest element foreach my $j ( 0 .. $w-1 ){ # Initialize the base counts. my %h = ( a=>0, t=>0, c=>0, g=>0 ); foreach my $mi ( @mi ) { $h{ lc substr($mi,$j,1) }++; # example: $h{g}++ +; } push @{$H{ uc $_ }}, $h{$_} for keys %h; # example: push @{ +$H{G}}, $g; } my @cons = (); my %prefOrder = ( A=>1, T=>2, C=>3, G=>4 ); foreach my $B ( 0 .. $w-1 ){ push @cons, [ sort { $H{$b}->[$B] <=> $H{$a}->[$B] \|\| $prefOrder +{$b} <=> $prefOrder{$a} } qw/A T G C/ ]->[0]; } return @cons; } [download] Update: more efficient way to set `%H` would be: `my %H = ( A=>[], T=>[], C=>[], G=>[] ); my @mi_letters = map { [split '', uc $_] } @mi; foreach my $j ( 0 .. $w-1 ){ $H{ $_->[$j] }->[$j]++ for @mi_letters; } # note: to avoid warnings, change the sort further down to include t +his (the '\|\|0'): # ($H{$b}->[$B]\|\|0) <=> ($H{$a}->[$B]\|\|0)` [download]	[reply] [d/l] [select]
Re: Finding Consensus String by Fletch (Bishop) on Oct 18, 2005 at 13:50 UTC
Just a WAG, but given the problem domain I'd be surprised if BioPerl didn't have something available already to do this. You might check its docs first before going further on your own.	[reply]
Re^2: Finding Consensus String by PreferredUserName (Pilgrim) on Oct 18, 2005 at 14:27 UTC
And if you do go looking for this, is the word "consensus" your own coinage, or does it have a technical meaning for this operation? Because "consensus" usually suggests "unanimity", while what you're looking for here is "most popular", like the mathematical mode. "Mode strings", perhaps?	[reply]
Re^3: Finding Consensus String by herveus (Prior) on Oct 18, 2005 at 17:12 UTC
Howdy! Consensus does not imply unanimity. It does imply that the group as a whole agrees with the position, and therefore that no one strenuously objects. It only takes one person to destroy consensus. I might not agree with the consensus, but I might not feel strongly enough about it to block the consensus. yours, Michael	[reply]
Re^3: Finding Consensus String by rvosa (Curate) on Oct 20, 2005 at 15:20 UTC
Consensus DNA sequences are a technical term. Your suggestions are better, but 'consensus' has already stuck.	[reply]
Re: Finding Consensus String by BrowserUk (Patriarch) on Oct 18, 2005 at 15:18 UTC
This appears to meet the spec? #! perl -slw use strict; sub consensus { my( $aref ) = @_; my $result; my $xor = $aref->[0]; $xor ^= $_ for @{ $aref }[ 1 .. $#$aref ]; for my $o ( 0 .. length( $xor ) -1 ) { if( substr $xor, $o, 1 eq chr(0) ) { $result .= substr $aref->[0], $o, 1; } else { my %c; $c{ substr $_, $o, 1 }++ for @{ $aref }; my @sorted = sort{ $c{ $b } <=> $c{ $a } } keys %c; my $add = join '', grep{ $c{ $_ } == $c{ $sorted[ 0 ] } } +@sorted; $add = "[$add]" if length( $add ) > 1; $result .= $add; } } return $result; } my @sets = ( [ qw[AAAAA ATCGA ATAAA] ], [ qw[AAAAA AACGA ATAAA ATAAA] ], [ qw[ACCGTA ATCGTA ACTGGA ATCCGA ] ], ); print "[@$_] : ", consensus( $_ ) for @sets; __END__ P:\test>500962 [AAAAA ATCGA ATAAA] : ATAAA [AAAAA AACGA ATAAA ATAAA] : A[AT]AAA [ACCGTA ATCGTA ACTGGA ATCCGA] : A[TC]CG[TG]A [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l]
Re: Finding Consensus String by Moron (Curate) on Oct 21, 2005 at 08:57 UTC
In this example, having chosen to sort descending by count, you can take the opportunity to exit the loop as soon as the count drops. I couldn't find any additional tricks that ended up faster other than the fact that apart from that this code was as simply translated from the requirement as I could manage: my @instances = qw ( AAAAA ATCGA ATAAA ); my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA ); for my $iref ( \@instances, \@instances2 ) { print Concensus( @$iref ) . "\n"; } sub Concensus { my $result = ''; my $length = length( $_[0] ); for (my $i = 0; $i < $length; $i++ ) { my %count = (); for my $input ( @_ ) { $count{ substr( $input, $i, 1 ) }++ } my $winners = ''; my $wincount = 0; for my $try ( sort { $count{ $b } <=> $count{ $a } } keys %cou +nt ) { if ( $wincount ) { ( $count{ $try } == $wincount ) or last; } else { $wincount = $count{ $try }; } $winners .= $try; } ( length( $winners ) > 1 ) and $winners = "[$winners]"; $result .= $winners; } return $result; } [download] -M Free your mind	[reply] [d/l]