Greetings Perl Monks!

I seek your wisdom for a problem of mine: I need to align 2 short DNA sequences. The starting material is 2 sequences, each 10 nucleotides long (the sequences should typically be As, Ts, Gs, or Cs. When sequence is ambiguous, Ns can also be expected).

There a few different possibilities in how the sequences may align to each other and I list them below:

a. 9 out of 10 in both align to each other perfectly

b. 10 out of 10 in both align to each other perfectly

c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

In all of the cases above, none of the sequences can contain anything but A/T/G/C. If there are other letters such as Ns etc., those cases will need to be discarded without even performing the match test

As 1st pass attempt, I have cobbled up some script, but I know it does not test for all cases above. Would you please tell me if I should use a different approach to test all cases listed above, or can I adapt what I have already?

Thank you exalted ones!

if((defined $upstream_putative_TSD)&&(defined $downstream_p +utative_TSD)) { # Check if the putative TSDs differ by just 1 mismatch or +are perfect matches my $max_SNP = 1; my $diffCount = () = ( $upstream_putative_TSD ^ $downs +tream_putative_TSD ) =~ /[^\x00]/g; # print $upstream_putative_TSD, "\t", $ +downstream_putative_TSD, "\t", $diffCount, "\n"; # OK thus far # syntax idea from https://www.biostars.org/p/83978/ if ($diffCount <= $max_SNP) { my $upstream_putative_TSD_non_canonical_letter +_count = $upstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//; # check to see whether upstream putative TSD c +ontains anything but A/T/G/C, if yes, how many my $downstream_putative_TSD_non_canonical_lett +er_count = $downstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//; # check to see whether downstream putative TSD + contains anything but A/T/G/C, if yes, how many if(($upstream_putative_TSD_non_canonical_l +etter_count==0)&&($downstream_putative_TSD_non_canonical_letter_count +==0)) { print $_, "\n"; push @output, $_, "\n"; } } }

In reply to Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.