gudluck has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of 10 or more very closely related chromosome (DNA) sequences (from different strains) aligned to a "single" reference chromosome. How can I generate multiple sequence alignment along with "reference" of all these sequences without affecting individual alignments to the reference.

What I need is to do is to "add dashes" at relative "insert" positions in other sequences, so that I get a global alignment with respect to reference. "Note. Here I am NOT looking for sequence similarities".

For example: If from position 100 to 125 in sequence No. 1 has an insert. but not in other 9 sequences. In the final expanded alignment I will insert 26 dashes (-) (from 100-125) in the sequences 2 to 10 and in the reference.

And in another situation like above if seq1 has insert at 100-125 and if seq5 has insert from 110-115 and seq8 has insert from 112-119. then I need to insert dashes like above, except in these positions (in seq5: 110-115 and in seq8: 112-119)

Please give me suggestions of logic to script to address this problem.
Input: Original reference: CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC Pairwise alignments: Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTC +TCCC Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTC +TCCC Output: Final Multiple sequence alignment: Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTT +TCTCTCCC Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTT +TC-CTCCC Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTT +TCTCTCCC Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTT +TCTCTCCC
  • Comment on how to construct multiple sequence alignment from group of pairwise alignments
  • Download Code

Replies are listed 'Best First'.
Re: how to construct multiple sequence alignment from group of pairwise alignments
by Corion (Patriarch) on Oct 19, 2010 at 17:26 UTC

    Have you looked at http://www.bioperl.org? This is not a script writing service and we don't really know much about DNA either. So, if you have a specific Perl problem, show your code, part of your input data, the output you get and the output you want.

    If you want advice on the general approach, I think you will have to change your problem description to require less biologic background.

Re: how to construct multiple sequence alignment from group of pairwise alignments
by snape (Pilgrim) on Oct 20, 2010 at 00:36 UTC

    There are couple of ways to do it. If you are interested in using the Clustalw (EBI) then you have to install Bioperl module for your perl to give the multiple sequences as a set and get the output through it. The module within Bioperl package for using it is here: Clustalw. Have a look at the synopsis as it very clearly defined

    The other way is by writing a Needleman Wunsch Program as a function in your perl program, providing the sequences one by one and making it compare with the ref seq sequence. Your Needleman Wunsch function is as follows:

    # Needleman-Wunsch Algorithm sub NeedlemanWunsch { my ($seq1, $seq2) = @ARGV; # scoring scheme my $MATCH = 1; # +1 for letters that match my $MISMATCH = -1; # -1 for letters that mismatch my $GAP = -1; # -1 for any gap # initialization my @matrix; $matrix[0][0]{score} = 0; $matrix[0][0]{pointer} = "none"; for(my $j = 1; $j <= length($seq1); $j++) { $matrix[0][$j]{score} = $GAP * $j; $matrix[0][$j]{pointer} = "left"; } for (my $i = 1; $i <= length($seq2); $i++) { $matrix[$i][0]{score} = $GAP * $i; $matrix[$i][0]{pointer} = "up"; } # fill for(my $i = 1; $i <= length($seq2); $i++) { for(my $j = 1; $j <= length($seq1); $j++) { my ($diagonal_score, $left_score, $up_score); # calculate match score my $letter1 = substr($seq1, $j-1, 1); my $letter2 = substr($seq2, $i-1, 1); + if ($letter1 eq $letter2) { $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH; } else { $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH; } # calculate gap scores $up_score = $matrix[$i-1][$j]{score} + $GAP; $left_score = $matrix[$i][$j-1]{score} + $GAP; # choose best score if ($diagonal_score >= $up_score) { if ($diagonal_score >= $left_score) { $matrix[$i][$j]{score} = $diagonal_score; $matrix[$i][$j]{pointer} = "diagonal"; } else { $matrix[$i][$j]{score} = $left_score; $matrix[$i][$j]{pointer} = "left"; } } else { if ($up_score >= $left_score) { $matrix[$i][$j]{score} = $up_score; $matrix[$i][$j]{pointer} = "up"; } else { $matrix[$i][$j]{score} = $left_score; $matrix[$i][$j]{pointer} = "left"; } } } } # trace-back my $align1 = ""; my $align2 = ""; # start at last cell of matrix my $j = length($seq1); my $i = length($seq2); while (1) { last if $matrix[$i][$j]{pointer} eq "none"; # ends at first cell o +f matrix if ($matrix[$i][$j]{pointer} eq "diagonal") { $align1 .= substr($seq1, $j-1, 1); $align2 .= substr($seq2, $i-1, 1); $i--; $j--; } elsif ($matrix[$i][$j]{pointer} eq "left") { $align1 .= substr($seq1, $j-1, 1); $align2 .= "-"; $j--; } elsif ($matrix[$i][$j]{pointer} eq "up") { $align1 .= "-"; $align2 .= substr($seq2, $i-1, 1); $i--; } } $align1 = reverse $align1; $align2 = reverse $align2; print "$align1\n"; print "$align2\n"; }

    I would prefer that you should use the Clustalw rather than using a needleman wunsch program as it would answer your questions related to the matrices that you want to use. I hope it works for you. All the best

      Thank you snape. I am aware of Needleman Wunsch Algorithm and clustalw /Muscle or other alignment programs. All these look into sequence similarities etc in generating MSA, Which I am not interested. I need to specifically align only using insert/deletions (indels) in respective sequences.
Re: how to construct multiple sequence alignment from group of pairwise alignments
by BrowserUk (Patriarch) on Oct 19, 2010 at 17:57 UTC

    It's not clear to me what you are actually asking for help with here?

    Do you already know where the insertion points are and just need help with adding the dashes in the right places?

    Or are you looking for an algorithm that allows you to find the insertion points?

    Also, little data and/or a worked example goes a long. If you have a set of sequences--shorted to say 200/300 chars; don't post megabytes sequences) that demonstrate what you need to do, it is usually much clearer to non-bio(*) monks.

    (*)Good'ol fashioned soap was good enough for my mum...{drone}... :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sorry for more biological description. I have updated with very short input and output information. There are many multiple sequence alignments programs, but they all consider sequence similarity (based on physico-chemical). I mainly interested in inserting "dashes".

        Are you saying you have (are given) the pairwise alignments (as strings with dashes) and all you need to produce is the "Final Multiple sequence alignment"?

        Also, it there a typo?

        ............V Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC Seq1: CGACAAAAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC ............^

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: how to construct multiple sequence alignment from group of pairwise alignments
by aquarium (Curate) on Oct 19, 2010 at 22:36 UTC
    i think i understand what you're after. a feature of the TK language is tagging in text/area fields. the tags are a way of marking positions (starting and ending) in text, which does not change the character count of the text. so what i have in mind is that you compare the reference sequence to each comparative sequence in turn, adding tags as you go. then at the end you do some post-processing to insert/delete/amend/other the text field depending on the tags inserted. or if you want the tags to immediately do something, you can bind them to actions (callbacks.) Anyway, I can't say enough about this really neat feature in TK. Most perls have TK or you can install it and the bindings.
    hope that's what you were looking for.
    the hardest line to type correctly is: stty erase ^H