Can't think of an algorithm to go..

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Can't think of an algorithm to go.. by rovf (Priest) on Jun 03, 2009 at 12:07 UTC
if for the letter B of SEQ, the first LBL line says I, the second I, the third M and the fourth I, we will use I in the consensus label line. I don't see a letter B in the SEQ line of your example, but I guess you mean the following (again using your example): The first letter in SEQ is M. All first letters in LBL are I, so you associate M with I. Further back, SEQ contains a W. The corresponding places in the LBL lines (same column) are M, O, M and M, so the majority would be M. We associate W to M. It is not clear how to associate, say, the letter K, because the first K has only I in the LBL lines, while a later one has only M. also, what is the consensus if there is the same number of, say, I and O? Maybe you could specify your task in a more precise way? -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l]
Re: Can't think of an algorithm to go.. by ikegami (Patriarch) on Jun 03, 2009 at 15:12 UTC
`sub consensus { # Picks an arbitrary result from # among the highest in case of tie. my %counts; $counts{$_}++ for @_; return ( sort { $counts{$b} <=> $counts{$a} } keys(%counts) )[0]; } { chomp( my $seq = <> ); chomp( my @lbls = <> ); s/^SEQ:// for $seq; s/^LBL:// for @lbls; for my $i (0..length($seq)-1) { printf("%s %s\n", substr($seq, $i, 1), consensus( map substr($_, $i, 1), @lbls ), ); } }` [download] Alternate output: `my $consensus = ''; for my $i (0..length($seq)-1) { $consensus .= consensus( map substr($_, $i, 1), @lbls ); } print("SEQ: $seq\n"); print("LBL: $consensus\n");` [download] Contrary to rovf and AZed, it's my understanding that the consensus for the "K" at index 1 isn't affected by the labels for the "K" at index 2. Let me know if that's wrong.	[reply] [d/l] [select]
Re^2: Can't think of an algorithm to go.. by Anonymous Monk on Jun 03, 2009 at 20:59 UTC
Thanks a lot to both of you! Ikegami, although I am still trying to figure out what you code means, it works perfect! Thanks for your trouble!	[reply]
Re^3: Can't think of an algorithm to go.. by ikegami (Patriarch) on Jun 03, 2009 at 21:23 UTC
For each index (`for my $i (0..length($seq)-1)`), it gets the letter at that index in each label (`map substr($_, $i, 1), @lbls`). To calculate the consensus, the number of instances of each letter is counted (`$counts{$_}++ for @_;`). Then the letters (`keys(%counts)`) are sorted by descending number of instances (`sort { $counts{$b} <=> $counts{$a} }`), and one of those with the highest count is returned (`( )[0]`).	[reply] [d/l] [select]
Re: Can't think of an algorithm to go.. by AZed (Monk) on Jun 03, 2009 at 13:27 UTC
I'd create an object with the following arrays as attributes: `@seq, @lbl1, @lbl2, @lbl3, @lbl4, @con`. Each element of each array is one letter. At that point, it's fairly simple to write a method to iterate through the desired length and dump the total count of each possible LBL letter into a temporary hash, find the largest value of those, and write that letter out into the same position for @con.	[reply] [d/l]
Re: Can't think of an algorithm to go.. by Anonymous Monk on Jun 11, 2009 at 14:02 UTC
Hmmm, sorry to bother you again, but, in the code you provided, could there be a modification when it comes to figuring out the consensus line? From what i can see, if, for example you have 5 I and 4 O, it is I. Can we say that, in order to decide for a specific position, there has to be an aggreement in 70% of the LBL sequences?	[reply]
Re^2: Can't think of an algorithm to go.. by ikegami (Patriarch) on Jun 11, 2009 at 14:42 UTC
`sub consensus { my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : undef ); }` [download] Up to you to decide what to do when consensus returns undef because a consensus wasn't achieved.	[reply] [d/l]
Re: Can't think of an algorithm to go.. by Anonymous Monk on Jun 11, 2009 at 14:56 UTC
Please, see this: `AAAAAAAAAAAAAAA-------------BBBBBBBBBBBBBBBBB-CCCCCCCCCCCCC---CCCCCCCC +CCCCCCC ------AAAAAAAAAAAAAAAAAA----------------BBBBBBBBBBBBBBBBBBCCCCCCCCCCCC +CCCCCCC -------------------AAAAAA---------BBBBBBBBBBBBBBBBBBBBBBB------CCCCCCC +CCCCCCC -------------------AAAAA----------------------------BBBBBBBBBBBBBDDDDD +DDDDDDD` [download] I add: `sub consensus { $rest='-'; my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : $rest ); }` [download] but it prints whatever I have in the last line of the file, for example: `-------------------AAAAA----------------------------BBBBBBBBBBBBBDDDDD +DDDDDDD` [download] What am I missing?	[reply] [d/l] [select]
Re^2: Can't think of an algorithm to go.. by ikegami (Patriarch) on Jun 11, 2009 at 21:40 UTC
It prints `-------------------AAAAA----------------BBBBB-------BBBBB------CCCCCCC +CCCCCCC` [download] for me. What am I missing? Your code. You didn't post it.	[reply] [d/l]
Re: Can't think of an algorithm to go.. by Anonymous Monk on Jun 13, 2009 at 12:44 UTC
Ikegami, thanks again... Here is what I have added to what you wrote beforehand: `sub consensus { $rest='-'; my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : $rest ); } while(<>) { $consensus = ''; @total_seqs=(); $seq=$_; push @total_seqs,$seq; for $i (0..length($seq)-1) { $consensus.= consensus( map substr($_, $i, 1), @total_seqs ) +; } }print("CON:$consensus\n");` [download]	[reply] [d/l]
Re^2: Can't think of an algorithm to go.. by Anonymous Monk on Jun 14, 2009 at 09:32 UTC
I think I found it... I didn't need to have `@total_seqs=();` inside my loop... now I have the same results as you... Thanks again for everything!	[reply] [d/l]