Re: Finding Consensus String
by davidrw (Prior) on Oct 18, 2005 at 13:11 UTC
|
before addressing the core issue, i see a couple immediate code items:
- avoid using $a and $b as variable names -- they have special meaning (see sort)
- In all of these: elsif ( $A[$b] = $T[$b] ) { you are using the assignment operator instead of the equality operator ==, so this code will not do what you want..
- not wrong as-is, but note it's "more perl-ish" to write loops like (multiple ways to write this): foreach my $b ( 0 .. $w-1 ){
i'll update shortly after looking at the actual issue.. my initial reaction is that the solution probably involves hashes instead of arrays (and maybe a function for "find-the-max" instead of that if/else ladder). | [reply] [d/l] [select] |
Re: Finding Consensus String
by japhy (Canon) on Oct 18, 2005 at 13:39 UTC
|
Since you're dealing with frequency, I'd use a hash to store the occurrences of A, C, G, and T in each position of the word. I'd loop over the DNA sequences for each character, looking at character 1 first in all strands, then character 2, etc.
sub consensus {
my $len = length($_[0]); # all strands are of one length, right?
my @cons;
# loop over the positions in the strands
for my $i (0 .. $len - 1) {
my %freq;
my $max = 0;
# loop over character $i in each of the strands
for (map substr($_, $i, 1), @_) {
$freq{$_}++;
# save the max value
$max = $freq{$_} if $freq{$_} > $max;
}
# store only those who occur the most times
push @cons, [grep $freq{$_} == $max, keys %freq];
}
return \@cons;
}
Then it's your task to turn the array of array references into the string you want.
| [reply] [d/l] |
|
|
Dear japhy,
Thanks so much for your answer to my posting. I just have
a question. Your code works well in generalize version
of array like this:
my @instances = ( 'AAAAA ATCGA',
'ATCGA AAAAA',
'ATAAA AAAAA',
);
But how can I make it more generalize for cases like
this?
my @instances2 = ( 'AAAAA ATCGA',
'ATCGA',
'ATAAA AAAAA',
);
# consensus : ATAAA A[AT][CA][GA]A
Or this:
my @instances3 = ( 'AAAAA ATCGA TTTT',
'ATCGA TATA ',
'ATAAA AAAAA ATTA ',
);
# consensus : ATAAA A[AT][CA][GA]A TTTA
Cause, currently it gives warning when encountering
two cases like that.
| [reply] [d/l] [select] |
|
|
I don't get any warnings. What I do notice is that space is included in the [...] parts. So here's a small fix:
sub consensus {
... # code from before, until:
# store only those who occur the most times
my @most = grep $freq{$_} == $max, keys %freq;
@most = grep $_ ne " ", @most if @most > 1;
push @cons, \@most;
}
return \@cons;
}
And to get the string version, I use:
sub consensus_to_string {
my $con = shift;
my $str;
for (@$con) {
if (@$_ > 1) { $str .= "[" . join("", @$_) . "]" }
else { $str .= $_->[0] }
}
return $str;
}
| [reply] [d/l] [select] |
Re: Finding Consensus String
by davidrw (Prior) on Oct 18, 2005 at 13:48 UTC
|
This doesn't change your functionality at all (i don't think), but does shorten all the code up and hopefully makes it much easier to modify from here.. Basically it makes use of hashes instead of having to declare a bunch of variables (e.g. $a, $g, $t, $c) so that everything is easier to work with.
One thing i didn't understand was the $w usage -- i assumed in my code that you want $w to be the length of the longest item, but your code was setting it to be the length of the last item...
use strict;
use warnings;
my @instances = qw ( AAAAA ATCGA ATAAA );
my @instances2 = qw ( AAAAA AACGA ATAAA ATAAA );
print consensus(@instances),"\n"; # ATAAA
print consensus(@instances2),"\n"; # ATAAA
exit;
sub consensus{
my @mi = @_;
chomp(@mi);
my $motif_count=0;
my @words =();
my %H = ( A=>[], T=>[], C=>[], G=>[] );
s/\s//g for @mi;
my ($w) = sort {$b <=> $a} map {length} @mi; # set w to the lengt
+h of the longest element
foreach my $j ( 0 .. $w-1 ){
# Initialize the base counts.
my %h = ( a=>0, t=>0, c=>0, g=>0 );
foreach my $mi ( @mi ) {
$h{ lc substr($mi,$j,1) }++; # example: $h{g}++
+;
}
push @{$H{ uc $_ }}, $h{$_} for keys %h; # example: push @{
+$H{G}}, $g;
}
my @cons = ();
my %prefOrder = ( A=>1, T=>2, C=>3, G=>4 );
foreach my $B ( 0 .. $w-1 ){
push @cons, [ sort { $H{$b}->[$B] <=> $H{$a}->[$B] || $prefOrder
+{$b} <=> $prefOrder{$a} } qw/A T G C/ ]->[0];
}
return @cons;
}
Update: more efficient way to set %H would be:
my %H = ( A=>[], T=>[], C=>[], G=>[] );
my @mi_letters = map { [split '', uc $_] } @mi;
foreach my $j ( 0 .. $w-1 ){
$H{ $_->[$j] }->[$j]++ for @mi_letters;
}
# note: to avoid warnings, change the sort further down to include t
+his (the '||0'):
# ($H{$b}->[$B]||0) <=> ($H{$a}->[$B]||0)
| [reply] [d/l] [select] |
Re: Finding Consensus String
by Fletch (Bishop) on Oct 18, 2005 at 13:50 UTC
|
Just a WAG, but given the problem domain I'd be surprised if BioPerl didn't have something available already to do this. You might check its docs first before going further on your own.
| [reply] |
|
|
And if you do go looking for this, is the word "consensus"
your own coinage, or does it have a technical meaning for
this operation?
Because "consensus" usually suggests "unanimity",
while what you're looking for here is "most popular",
like the mathematical mode. "Mode strings", perhaps?
| [reply] |
|
|
Howdy!
Consensus does *not* imply unanimity. It does imply that the group as
a whole agrees with the position, and therefore that no one strenuously
objects. It only takes one person to destroy consensus. I might not
agree with the consensus, but I might not feel strongly enough about
it to block the consensus.
| [reply] |
|
|
Consensus DNA sequences are a technical term. Your suggestions are better, but 'consensus' has already stuck.
| [reply] |
Re: Finding Consensus String
by BrowserUk (Patriarch) on Oct 18, 2005 at 15:18 UTC
|
#! perl -slw
use strict;
sub consensus {
my( $aref ) = @_;
my $result;
my $xor = $aref->[0];
$xor ^= $_ for @{ $aref }[ 1 .. $#$aref ];
for my $o ( 0 .. length( $xor ) -1 ) {
if( substr $xor, $o, 1 eq chr(0) ) {
$result .= substr $aref->[0], $o, 1;
}
else {
my %c; $c{ substr $_, $o, 1 }++ for @{ $aref };
my @sorted = sort{ $c{ $b } <=> $c{ $a } } keys %c;
my $add = join '', grep{ $c{ $_ } == $c{ $sorted[ 0 ] } }
+@sorted;
$add = "[$add]" if length( $add ) > 1;
$result .= $add;
}
}
return $result;
}
my @sets = (
[ qw[AAAAA ATCGA ATAAA] ],
[ qw[AAAAA AACGA ATAAA ATAAA] ],
[ qw[ACCGTA ATCGTA ACTGGA ATCCGA ] ],
);
print "[@$_] : ", consensus( $_ ) for @sets;
__END__
P:\test>500962
[AAAAA ATCGA ATAAA] : ATAAA
[AAAAA AACGA ATAAA ATAAA] : A[AT]AAA
[ACCGTA ATCGTA ACTGGA ATCCGA] : A[TC]CG[TG]A
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [d/l] |
Re: Finding Consensus String
by Moron (Curate) on Oct 21, 2005 at 08:57 UTC
|
In this example, having chosen to sort descending by count, you can take the opportunity to exit the loop as soon as the count drops. I couldn't find any additional tricks that ended up faster other than the fact that apart from that this code was as simply translated from the requirement as I could manage: my @instances = qw (
AAAAA
ATCGA
ATAAA
);
my @instances2 = qw (
AAAAA
AACGA
ATAAA
ATAAA
);
for my $iref ( \@instances, \@instances2 ) {
print Concensus( @$iref ) . "\n";
}
sub Concensus {
my $result = '';
my $length = length( $_[0] );
for (my $i = 0; $i < $length; $i++ ) {
my %count = ();
for my $input ( @_ ) {
$count{ substr( $input, $i, 1 ) }++
}
my $winners = '';
my $wincount = 0;
for my $try ( sort { $count{ $b } <=> $count{ $a } } keys %cou
+nt ) {
if ( $wincount ) {
( $count{ $try } == $wincount ) or last;
}
else {
$wincount = $count{ $try };
}
$winners .= $try;
}
( length( $winners ) > 1 ) and $winners = "[$winners]";
$result .= $winners;
}
return $result;
}
| [reply] [d/l] |