Sorry, sorry, sorry Monks! Here are some verbose explanations and examples, hope these help

Before that, just a lil bit of sequence alignment lingo from biologists dictionary:

substitution: when a letter in one sequence is replaced by a different letter. in another sequence. eg.GGTA is substituted in 2 places to give CCTA , TACGACT substituted in 1 place to give AACGACT etc.

indel: when one letter in one sequence is replaced by nothing in the second sequence. 1st sequence is said to have an insertion (IN) and 2nd sequence has a deletion (DEL). Hence the term INDEL eg. TAGAGGATC and TAGAGATC differ by 1 indel position, so 2nd sequence when aligned would be TAGAG-ATC or TAGA-GATC

case c: the mismatch is not due to substitution of one letter for another, but a gap (shown as '-' here) due to a missing letter when comparing the 2 sequences

AC-TACGTAC ACGTACGTAC

or

ACGTACGTAC ACGTACGT-C

case d: the mismatch is due to substitution of one letter for another, and not an insertion or deletion as show in examples above, for case c.

CTTACGTAC CGTACGTAC

or

CGTACGTGC CGTACGTCC

case e: same as case d. above, except the matched lengths are 10 letters long, and not 9 letters as for case d.. Mis-match is not due to insertion or deletion, but a substitution, again as for case d.

ACCTACGTAC ACGTACGTAC

or

GTACGTACGG GTACGTTCGG

Some examples of what should pass the filters and what should not are shown below

10nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

ATGGACGTAC ATGGACGTAC

9nt sequences, no indels, no substitutions, perfect matches, passes filter, all OK

CGTACAGTA CGTACAGTA

10nt sequences, 1 indel position, passes filter, OK

AC-TACGTAC ACGTACGTAC

10nt sequences, 2 indel positions in total,1 indel on one sequence and 2nd indel on 2nd sequence, does not passes filter, not OK

AC-TACGTAC ACGTACG-AC

10nt sequences, 2 indel positions in total, both indels on same sequence, does not passes filter, not OK

AC-TAC-TAC ACGTACGTAC

Bottom line is that when sequences of 10 letters are aligned to each other, there should be at the very minimum 9 letters that are aligned with a maximum of 1 indel or substitution. At the very best, all 10 letters are perfectly matched with no indels and no substitutions. And all other intermediate cases, some with examples of alignments above. I hope things are a little better to understand now, especially for non-biologists. Sorry for the cryptic explanation in my OP!


In reply to Re^2: Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft
in thread Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.