in reply to Re^5: mismatching characters in dna sequence
in thread mismatching characters in dna sequence

That's because you didn't mention the "N" in your original post :)

The idea of the transliteration is that the XOR value computed for every (directed) comparison of characters is different. This can only be determined for a predefined set of allowed characters.

To also allow "N", you could (for example) use the transliteration tr/ATCGN/J4XD7/. With this, the XOR values for the respective changes would compute as:

XOR val change \x0b => A->A * ( "A" ^ "J" ) \x19 => A->C ( "A" ^ "X" ) \x05 => A->G ( "A" ^ "D" ) \x76 => A->N ... \x75 => A->T \x09 => C->A \x1b => C->C * \x07 => C->G \x74 => C->N \x77 => C->T \x0d => G->A \x1f => G->C \x03 => G->G * \x70 => G->N \x73 => G->T \x04 => N->A \x16 => N->C \x0a => N->G \x79 => N->N * \x7a => N->T \x1e => T->A \x0c => T->C \x10 => T->G \x63 => T->N \x60 => T->T *

The ones marked with "*" are the "no-changes", which should make up the exclusion character set in the final match. I.e., with the above modified transliteration, you should change that to

while ($diff =~ /([^\x0b\x1b\x03\x79\x60])/g) {

Replies are listed 'Best First'.
Re^7: mismatching characters in dna sequence
by prbndr (Acolyte) on Dec 30, 2011 at 05:54 UTC
    thank you! the reason i didn't mention it was because i filter out anything with an N prior to this tabulation of conversions. i asked more because i wanted to understand the transliteration process. i hope this is relatively quick for around 10 million sequences!