in reply to Filtering matches of near-perfect-matched DNA sequence pairs

This looks like a really fun problem to work on, but I'm unclear about your listed possibilities. Could you please (please please please) provide several examples of each of the five listed cases and a test set (maybe 20 or 30 items) with expected results.

  • Comment on Re: Filtering matches of near-perfect-matched DNA sequence pairs

Replies are listed 'Best First'.
Re^2: Filtering matches of near-perfect-matched DNA sequence pairs
by onlyIDleft (Scribe) on Mar 13, 2015 at 23:30 UTC

    Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help

    Before that, just a lil bit of sequence alignment lingo from biologists dictionary:

    substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc.

    indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC

    case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences

    AC-TACGTAC ACGTACGTAC

    or

    ACGTACGTAC ACGTACGT-C

    case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c.

    CTTACGTAC CGTACGTAC

    or

    CGTACGTGC CGTACGTCC

    case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d.

    ACCTACGTAC ACGTACGTAC

    or

    GTACGTACGG GTACGTTCGG

    Some examples of what should pass the filters and what should not are shown below

    10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

    ATGGACGTAC ATGGACGTAC

    9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

    CGTACAGTA CGTACAGTA

    10nt sequences, 1 indel position, passes filter, OK

    AC-TACGTAC ACGTACGTAC

    10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK

    AC-TACGTAC ACGTACG-AC

    10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK

    AC-TAC-TAC ACGTACGTAC

    Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP!

      Thank you. I still don't understand, though. What is the input? Are the dashes already there?
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        And if the dashes are not there, are both sequences still 10 long? Unlike what you showed...

        Nope, the dashes are to help the reader understand where there is a is an insertion/deletion (indel) event, and not a substitution of a letter.

        Such a gap caused by indel(s). i.e. absence of an aligned letter is commonly signified by the '-' symbol in sequence alignments by biologists.

        You may replace it in your mind with just a blank space if that helps you. Hope that clarifies it.