monkfan has asked for the wisdom of the Perl Monks concerning the following question:

My most respected monks,
I am looking a way to find consensus string from given set of strings (length is assumed to be the same within the set). Basically consensus string mean single string that represent most common occurence from the given set. Here is the example of it:
Set 1: Set 2: AAAAA AAAAA ATCGA AACGA ATAAA ATAAA ATAAA Consensus: ATAAA Consensus: A[AT]AAA Note: like Set 2, there can be other cases that has more than two consensus in a particular position. They need to be captured within square bracket [].
Below is my incredibly naive, inefficient and inflexible code. There are two issues which I still unable to solve here:
So here I am again, humbly seeking for your wisdom.

My current code:
use strict; use warnings; my @instances = qw ( AAAAA ATCGA ATAAA ); my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA ); print consensus(@instances),"\n"; sub consensus{ my @mi = @_; my $motif_count=0; my @words =(); my $L = undef; my @A = (); my @T = (); my @C = (); my @G = (); my $a = 0; my $c = 0; my $g = 0; my $t = 0; my $w = 0; foreach my $mi ( @mi ) { chomp($mi); $mi =~ s/\s//g; $w = length($mi); #for motif instances count @words = split( /\W+/, $mi ); $motif_count += @words; } for ( my $j = 0 ; $j < $w ; $j++ ) { # Initialize the base counts. my $L =0; foreach my $mi ( @mi ) { chomp($mi); my $L = $mi; my $sb = substr( $L, $j, 1 ); while ( $sb =~ /a/ig ) { $a++ } while ( $sb =~ /t/ig ) { $t++ } while ( $sb =~ /c/ig ) { $c++ } while ( $sb =~ /g/ig ) { $g++ } } push( @A, $a ); push( @T, $t ); push( @C, $c ); push( @G, $g ); $a = 0; $c = 0; $g = 0; $t = 0; } my @cons = (); #print "$w\n"; for (my $b=0; $b <$w ;$b++) { if ( $A[$b] > $T[$b] && $A[$b] > $C[$b] && $A[$b] > $G[$b] ) { push( @cons, 'A'); } elsif ( $T[$b] > $C[$b] && $T[$b] > $A[$b] && $T[$b] > $G[$b] ) { push( @cons, 'T' ); } elsif ( $C[$b] > $G[$b] && $C[$b] > $A[$b] && $C[$b] > $T[$b] ) { push( @cons, 'C' ); } elsif ( $G[$b] > $A[$b] && $G[$b] > $C[$b] && $G[$b] > $T[$b] ) { push( @cons, 'G' ); } elsif ( $A[$b] = $T[$b] ) { push( @cons, 'T' ); } elsif ( $A[$b] = $G[$b] ) { push( @cons, 'G' ); } elsif ( $A[$b] = $C[$b] ) { push( @cons, 'C' ); } elsif ( $T[$b] = $C[$b] ) { push( @cons, 'C' ); } elsif ( $T[$b] = $G[$b] ) { push( @cons, 'G' ); } else { push @cons, 'G'; } } return @cons; }

Regards,
Edward

Replies are listed 'Best First'.
Re: Finding Consensus String
by davidrw (Prior) on Oct 18, 2005 at 13:11 UTC
    before addressing the core issue, i see a couple immediate code items:
    • avoid using $a and $b as variable names -- they have special meaning (see sort)
    • In all of these: elsif ( $A[$b] = $T[$b] ) { you are using the assignment operator instead of the equality operator ==, so this code will not do what you want..
    • not wrong as-is, but note it's "more perl-ish" to write loops like (multiple ways to write this): foreach my $b ( 0 .. $w-1 ){
    i'll update shortly after looking at the actual issue.. my initial reaction is that the solution probably involves hashes instead of arrays (and maybe a function for "find-the-max" instead of that if/else ladder).
Re: Finding Consensus String
by japhy (Canon) on Oct 18, 2005 at 13:39 UTC
    Since you're dealing with frequency, I'd use a hash to store the occurrences of A, C, G, and T in each position of the word. I'd loop over the DNA sequences for each character, looking at character 1 first in all strands, then character 2, etc.
    sub consensus { my $len = length($_[0]); # all strands are of one length, right? my @cons; # loop over the positions in the strands for my $i (0 .. $len - 1) { my %freq; my $max = 0; # loop over character $i in each of the strands for (map substr($_, $i, 1), @_) { $freq{$_}++; # save the max value $max = $freq{$_} if $freq{$_} > $max; } # store only those who occur the most times push @cons, [grep $freq{$_} == $max, keys %freq]; } return \@cons; }
    Then it's your task to turn the array of array references into the string you want.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Dear japhy,
      Thanks so much for your answer to my posting. I just have a question. Your code works well in generalize version of array like this:
      my @instances = ( 'AAAAA ATCGA', 'ATCGA AAAAA', 'ATAAA AAAAA', );
      But how can I make it more generalize for cases like this?
      my @instances2 = ( 'AAAAA ATCGA', 'ATCGA', 'ATAAA AAAAA', ); # consensus : ATAAA A[AT][CA][GA]A
      Or this:
      my @instances3 = ( 'AAAAA ATCGA TTTT', 'ATCGA TATA ', 'ATAAA AAAAA ATTA ', ); # consensus : ATAAA A[AT][CA][GA]A TTTA
      Cause, currently it gives warning when encountering two cases like that.

      Regards,
      Edward
        I don't get any warnings. What I do notice is that space is included in the [...] parts. So here's a small fix:
        sub consensus { ... # code from before, until: # store only those who occur the most times my @most = grep $freq{$_} == $max, keys %freq; @most = grep $_ ne " ", @most if @most > 1; push @cons, \@most; } return \@cons; }
        And to get the string version, I use:
        sub consensus_to_string { my $con = shift; my $str; for (@$con) { if (@$_ > 1) { $str .= "[" . join("", @$_) . "]" } else { $str .= $_->[0] } } return $str; }

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Finding Consensus String
by davidrw (Prior) on Oct 18, 2005 at 13:48 UTC
    This doesn't change your functionality at all (i don't think), but does shorten all the code up and hopefully makes it much easier to modify from here.. Basically it makes use of hashes instead of having to declare a bunch of variables (e.g. $a, $g, $t, $c) so that everything is easier to work with.

    One thing i didn't understand was the $w usage -- i assumed in my code that you want $w to be the length of the longest item, but your code was setting it to be the length of the last item...
    use strict; use warnings; my @instances = qw ( AAAAA ATCGA ATAAA ); my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA ); print consensus(@instances),"\n"; # ATAAA print consensus(@instances2),"\n"; # ATAAA exit; sub consensus{ my @mi = @_; chomp(@mi); my $motif_count=0; my @words =(); my %H = ( A=>[], T=>[], C=>[], G=>[] ); s/\s//g for @mi; my ($w) = sort {$b <=> $a} map {length} @mi; # set w to the lengt +h of the longest element foreach my $j ( 0 .. $w-1 ){ # Initialize the base counts. my %h = ( a=>0, t=>0, c=>0, g=>0 ); foreach my $mi ( @mi ) { $h{ lc substr($mi,$j,1) }++; # example: $h{g}++ +; } push @{$H{ uc $_ }}, $h{$_} for keys %h; # example: push @{ +$H{G}}, $g; } my @cons = (); my %prefOrder = ( A=>1, T=>2, C=>3, G=>4 ); foreach my $B ( 0 .. $w-1 ){ push @cons, [ sort { $H{$b}->[$B] <=> $H{$a}->[$B] || $prefOrder +{$b} <=> $prefOrder{$a} } qw/A T G C/ ]->[0]; } return @cons; }
    Update: more efficient way to set %H would be:
    my %H = ( A=>[], T=>[], C=>[], G=>[] ); my @mi_letters = map { [split '', uc $_] } @mi; foreach my $j ( 0 .. $w-1 ){ $H{ $_->[$j] }->[$j]++ for @mi_letters; } # note: to avoid warnings, change the sort further down to include t +his (the '||0'): # ($H{$b}->[$B]||0) <=> ($H{$a}->[$B]||0)
Re: Finding Consensus String
by Fletch (Bishop) on Oct 18, 2005 at 13:50 UTC

    Just a WAG, but given the problem domain I'd be surprised if BioPerl didn't have something available already to do this. You might check its docs first before going further on your own.

      And if you do go looking for this, is the word "consensus" your own coinage, or does it have a technical meaning for this operation?

      Because "consensus" usually suggests "unanimity", while what you're looking for here is "most popular", like the mathematical mode. "Mode strings", perhaps?

        Howdy!

        Consensus does *not* imply unanimity. It does imply that the group as a whole agrees with the position, and therefore that no one strenuously objects. It only takes one person to destroy consensus. I might not agree with the consensus, but I might not feel strongly enough about it to block the consensus.

        yours,
        Michael
        Consensus DNA sequences are a technical term. Your suggestions are better, but 'consensus' has already stuck.
Re: Finding Consensus String
by BrowserUk (Patriarch) on Oct 18, 2005 at 15:18 UTC

    This appears to meet the spec?

    #! perl -slw use strict; sub consensus { my( $aref ) = @_; my $result; my $xor = $aref->[0]; $xor ^= $_ for @{ $aref }[ 1 .. $#$aref ]; for my $o ( 0 .. length( $xor ) -1 ) { if( substr $xor, $o, 1 eq chr(0) ) { $result .= substr $aref->[0], $o, 1; } else { my %c; $c{ substr $_, $o, 1 }++ for @{ $aref }; my @sorted = sort{ $c{ $b } <=> $c{ $a } } keys %c; my $add = join '', grep{ $c{ $_ } == $c{ $sorted[ 0 ] } } +@sorted; $add = "[$add]" if length( $add ) > 1; $result .= $add; } } return $result; } my @sets = ( [ qw[AAAAA ATCGA ATAAA] ], [ qw[AAAAA AACGA ATAAA ATAAA] ], [ qw[ACCGTA ATCGTA ACTGGA ATCCGA ] ], ); print "[@$_] : ", consensus( $_ ) for @sets; __END__ P:\test>500962 [AAAAA ATCGA ATAAA] : ATAAA [AAAAA AACGA ATAAA ATAAA] : A[AT]AAA [ACCGTA ATCGTA ACTGGA ATCCGA] : A[TC]CG[TG]A

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Finding Consensus String
by Moron (Curate) on Oct 21, 2005 at 08:57 UTC
    In this example, having chosen to sort descending by count, you can take the opportunity to exit the loop as soon as the count drops. I couldn't find any additional tricks that ended up faster other than the fact that apart from that this code was as simply translated from the requirement as I could manage:
    my @instances = qw ( AAAAA ATCGA ATAAA ); my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA ); for my $iref ( \@instances, \@instances2 ) { print Concensus( @$iref ) . "\n"; } sub Concensus { my $result = ''; my $length = length( $_[0] ); for (my $i = 0; $i < $length; $i++ ) { my %count = (); for my $input ( @_ ) { $count{ substr( $input, $i, 1 ) }++ } my $winners = ''; my $wincount = 0; for my $try ( sort { $count{ $b } <=> $count{ $a } } keys %cou +nt ) { if ( $wincount ) { ( $count{ $try } == $wincount ) or last; } else { $wincount = $count{ $try }; } $winners .= $try; } ( length( $winners ) > 1 ) and $winners = "[$winners]"; $result .= $winners; } return $result; }

    -M

    Free your mind