in reply to Re^4: mismatching characters in dna sequence
in thread mismatching characters in dna sequence

in some very rare cases i have a conversion from an A->N. the N is just another character. the other code snippets catch this type of conversion whereas your code calls it a G->A. why is this the case?
  • Comment on Re^5: mismatching characters in dna sequence

Replies are listed 'Best First'.
Re^6: mismatching characters in dna sequence
by Eliya (Vicar) on Dec 30, 2011 at 05:44 UTC

    That's because you didn't mention the "N" in your original post :)

    The idea of the transliteration is that the XOR value computed for every (directed) comparison of characters is different. This can only be determined for a predefined set of allowed characters.

    To also allow "N", you could (for example) use the transliteration tr/ATCGN/J4XD7/. With this, the XOR values for the respective changes would compute as:

    XOR val change \x0b => A->A * ( "A" ^ "J" ) \x19 => A->C ( "A" ^ "X" ) \x05 => A->G ( "A" ^ "D" ) \x76 => A->N ... \x75 => A->T \x09 => C->A \x1b => C->C * \x07 => C->G \x74 => C->N \x77 => C->T \x0d => G->A \x1f => G->C \x03 => G->G * \x70 => G->N \x73 => G->T \x04 => N->A \x16 => N->C \x0a => N->G \x79 => N->N * \x7a => N->T \x1e => T->A \x0c => T->C \x10 => T->G \x63 => T->N \x60 => T->T *

    The ones marked with "*" are the "no-changes", which should make up the exclusion character set in the final match. I.e., with the above modified transliteration, you should change that to

    while ($diff =~ /([^\x0b\x1b\x03\x79\x60])/g) {
      thank you! the reason i didn't mention it was because i filter out anything with an N prior to this tabulation of conversions. i asked more because i wanted to understand the transliteration process. i hope this is relatively quick for around 10 million sequences!