in reply to Re: Filtering matches of near-perfect-matched DNA sequence pairs
in thread Filtering matches of near-perfect-matched DNA sequence pairs

Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help

Before that, just a lil bit of sequence alignment lingo from biologists dictionary:

substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc.

indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC

case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences

AC-TACGTAC ACGTACGTAC

or

ACGTACGTAC ACGTACGT-C

case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c.

CTTACGTAC CGTACGTAC

or

CGTACGTGC CGTACGTCC

case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d.

ACCTACGTAC ACGTACGTAC

or

GTACGTACGG GTACGTTCGG

Some examples of what should pass the filters and what should not are shown below

10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

ATGGACGTAC ATGGACGTAC

9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

CGTACAGTA CGTACAGTA

10nt sequences, 1 indel position, passes filter, OK

AC-TACGTAC ACGTACGTAC

10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK

AC-TACGTAC ACGTACG-AC

10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK

AC-TAC-TAC ACGTACGTAC

Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP!

Replies are listed 'Best First'.
Re^3: Filtering matches of near-perfect-matched DNA sequence pairs
by choroba (Cardinal) on Mar 13, 2015 at 23:39 UTC
    Thank you. I still don't understand, though. What is the input? Are the dashes already there?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      And if the dashes are not there, are both sequences still 10 long? Unlike what you showed...

      Nope, the dashes are to help the reader understand where there is a is an insertion/deletion (indel) event, and not a substitution of a letter.

      Such a gap caused by indel(s). i.e. absence of an aligned letter is commonly signified by the '-' symbol in sequence alignments by biologists.

      You may replace it in your mind with just a blank space if that helps you. Hope that clarifies it.

        Then what actual input caused the strings with the dashes in them? Are all strings with - in them a result instead of an input?

        This just makes it more unclear exactly what the inputs to your problem are. You are showing sequences with dashes and sequences with only nine characters in them, in contradiction to your original problem statement. (two 10 character strings)

        Please make up a test file with the *real* input sequences in it, and also show the expected output for each input.