Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I have a file with entries as follows:
>ttt SEQ:MKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGFINNNGPTHENQLGAGAFGGYQ + LBL:IIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMI + LBL:IIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOMMMMMMMM + LBL:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMOOOOOOOOOOOMMMMMMMMMMMI + LBL:IIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOMMMMMMMMI

What I must do is store all lines beginning with LBL (easy part) and then, create a consensus label line foreach letter in SEQ (M,K,K,T,A,I etc).The consensus line will have I,M,O, like all LBL lines, and, in order to descide which letter to use, we will assign a label letter to a SEQ letter if the majority of LBL lines agree.
For example, if for the letter B of SEQ, the first LBL line says I, the second I, the third M and the fourth I, we will use I in the consensus label line.
I thought using hashes, but I have a problem how to store each labelm, please advice...

Replies are listed 'Best First'.
Re: Can't think of an algorithm to go..
by rovf (Priest) on Jun 03, 2009 at 12:07 UTC
    if for the letter B of SEQ, the first LBL line says I, the second I, the third M and the fourth I, we will use I in the consensus label line.

    I don't see a letter B in the SEQ line of your example, but I guess you mean the following (again using your example):

    • The first letter in SEQ is M. All first letters in LBL are I, so you associate M with I.
    • Further back, SEQ contains a W. The corresponding places in the LBL lines (same column) are M, O, M and M, so the majority would be M. We associate W to M.
    It is not clear how to associate, say, the letter K, because the first K has only I in the LBL lines, while a later one has only M. also, what is the consensus if there is the same number of, say, I and O? Maybe you could specify your task in a more precise way?

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Can't think of an algorithm to go..
by ikegami (Patriarch) on Jun 03, 2009 at 15:12 UTC
    sub consensus { # Picks an arbitrary result from # among the highest in case of tie. my %counts; $counts{$_}++ for @_; return ( sort { $counts{$b} <=> $counts{$a} } keys(%counts) )[0]; } { chomp( my $seq = <> ); chomp( my @lbls = <> ); s/^SEQ:// for $seq; s/^LBL:// for @lbls; for my $i (0..length($seq)-1) { printf("%s %s\n", substr($seq, $i, 1), consensus( map substr($_, $i, 1), @lbls ), ); } }

    Alternate output:

    my $consensus = ''; for my $i (0..length($seq)-1) { $consensus .= consensus( map substr($_, $i, 1), @lbls ); } print("SEQ: $seq\n"); print("LBL: $consensus\n");

    Contrary to rovf and AZed, it's my understanding that the consensus for the "K" at index 1 isn't affected by the labels for the "K" at index 2. Let me know if that's wrong.

      Thanks a lot to both of you!
      Ikegami, although I am still trying to figure out what you code means, it works perfect!
      Thanks for your trouble!

        For each index (for my $i (0..length($seq)-1)), it gets the letter at that index in each label (map substr($_, $i, 1), @lbls).

        To calculate the consensus, the number of instances of each letter is counted ($counts{$_}++ for @_;). Then the letters (keys(%counts)) are sorted by descending number of instances (sort { $counts{$b} <=> $counts{$a} }), and one of those with the highest count is returned (( )[0]).

Re: Can't think of an algorithm to go..
by AZed (Monk) on Jun 03, 2009 at 13:27 UTC

    I'd create an object with the following arrays as attributes: @seq, @lbl1, @lbl2, @lbl3, @lbl4, @con. Each element of each array is one letter.

    At that point, it's fairly simple to write a method to iterate through the desired length and dump the total count of each possible LBL letter into a temporary hash, find the largest value of those, and write that letter out into the same position for @con.

Re: Can't think of an algorithm to go..
by Anonymous Monk on Jun 11, 2009 at 14:02 UTC
    Hmmm, sorry to bother you again, but, in the code you provided, could there be a modification when it comes to figuring out the consensus line? From what i can see, if, for example you have 5 I and 4 O, it is I. Can we say that, in order to decide for a specific position, there has to be an aggreement in 70% of the LBL sequences?
      sub consensus { my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : undef ); }

      Up to you to decide what to do when consensus returns undef because a consensus wasn't achieved.

Re: Can't think of an algorithm to go..
by Anonymous Monk on Jun 11, 2009 at 14:56 UTC
    Please, see this:
    AAAAAAAAAAAAAAA-------------BBBBBBBBBBBBBBBBB-CCCCCCCCCCCCC---CCCCCCCC +CCCCCCC ------AAAAAAAAAAAAAAAAAA----------------BBBBBBBBBBBBBBBBBBCCCCCCCCCCCC +CCCCCCC -------------------AAAAAA---------BBBBBBBBBBBBBBBBBBBBBBB------CCCCCCC +CCCCCCC -------------------AAAAA----------------------------BBBBBBBBBBBBBDDDDD +DDDDDDD
    I add:
    sub consensus { $rest='-'; my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : $rest ); }

    but it prints whatever I have in the last line of the file, for example:
    -------------------AAAAA----------------------------BBBBBBBBBBBBBDDDDD +DDDDDDD
    What am I missing?
      It prints
      -------------------AAAAA----------------BBBBB-------BBBBB------CCCCCCC +CCCCCCC
      for me.

      What am I missing?

      Your code. You didn't post it.

Re: Can't think of an algorithm to go..
by Anonymous Monk on Jun 13, 2009 at 12:44 UTC
    Ikegami, thanks again... Here is what I have added to what you wrote beforehand:
    sub consensus { $rest='-'; my %counts; $counts{$_}++ for @_; my ($consensus) = sort { $counts{$b} <=> $counts{$a} } keys(%counts); return ( $counts{$consensus} / @_ >= 0.7 ? $consensus : $rest ); } while(<>) { $consensus = ''; @total_seqs=(); $seq=$_; push @total_seqs,$seq; for $i (0..length($seq)-1) { $consensus.= consensus( map substr($_, $i, 1), @total_seqs ) +; } }print("CON:$consensus\n");
      I think I found it...
      I didn't need to have @total_seqs=(); inside my loop...
      now I have the same results as you...
      Thanks again for everything!