comment on

Greetings Perl Monks!

I seek your wisdom for a problem of mine: I need to align 2 short DNA sequences. The starting material is 2 sequences, each 10 nucleotides long (the sequences should typically be As, Ts, Gs, or Cs. When sequence is ambiguous, Ns can also be expected).

There a few different possibilities in how the sequences may align to each other and I list them below:

a. 9 out of 10 in both align to each other perfectly

b. 10 out of 10 in both align to each other perfectly

c. 9 in one and 10 in other align to each other - with this imperfect alignment due to insertion/deletion

d. 9 out of 9 in both align to each other,but imperfectly due to substitution - but I will allow only one such substitution - for biological reasons

e. 10 out of 10 in both align to each other, but imperfectly due to substitution - but I will allow only one such substitution - again for biological reasons

In all of the cases above, none of the sequences can contain anything but A/T/G/C. If there are other letters such as Ns etc., those cases will need to be discarded without even performing the match test

As 1st pass attempt, I have cobbled up some script, but I know it does not test for all cases above. Would you please tell me if I should use a different approach to test all cases listed above, or can I adapt what I have already?

Thank you exalted ones!

           if((defined $upstream_putative_TSD)&&(defined $downstream_p
+utative_TSD))
            {
            # Check if the putative TSDs differ by just 1 mismatch or 
+are perfect matches
                my $max_SNP = 1;
                my $diffCount = () = ( $upstream_putative_TSD ^ $downs
+tream_putative_TSD ) =~ /[^\x00]/g;
                #                print $upstream_putative_TSD, "\t", $
+downstream_putative_TSD, "\t", $diffCount, "\n"; # OK thus far

                # syntax idea from https://www.biostars.org/p/83978/
                    if ($diffCount <= $max_SNP)
                    {
                        my $upstream_putative_TSD_non_canonical_letter
+_count = $upstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
                        # check to see whether upstream putative TSD c
+ontains anything but A/T/G/C, if yes, how many
                        my $downstream_putative_TSD_non_canonical_lett
+er_count = $downstream_putative_TSD =~ tr/BDEFHIJKLMNOPQRSUVWXYZ//;
                        # check to see whether downstream putative TSD
+ contains anything but A/T/G/C, if yes, how many
                            if(($upstream_putative_TSD_non_canonical_l
+etter_count==0)&&($downstream_putative_TSD_non_canonical_letter_count
+==0))
                                {
                                    print $_, "\n";
                                    push @output, $_, "\n";
                                }
                    }
            }
[download]

In reply to Filtering matches of near-perfect-matched DNA sequence pairs by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.