Re^6: mismatching characters in dna sequence

That's because you didn't mention the "N" in your original post :)

The idea of the transliteration is that the XOR value computed for every (directed) comparison of characters is different. This can only be determined for a predefined set of allowed characters.

To also allow "N", you could (for example) use the transliteration tr/ATCGN/J4XD7/. With this, the XOR values for the respective changes would compute as:

       XOR val    change

          \x0b => A->A  *    ( "A" ^ "J" )
          \x19 => A->C       ( "A" ^ "X" )
          \x05 => A->G       ( "A" ^ "D" )
          \x76 => A->N       ...
          \x75 => A->T
          \x09 => C->A
          \x1b => C->C  *
          \x07 => C->G
          \x74 => C->N
          \x77 => C->T
          \x0d => G->A
          \x1f => G->C
          \x03 => G->G  *
          \x70 => G->N
          \x73 => G->T
          \x04 => N->A
          \x16 => N->C
          \x0a => N->G
          \x79 => N->N  *
          \x7a => N->T
          \x1e => T->A
          \x0c => T->C
          \x10 => T->G
          \x63 => T->N
          \x60 => T->T  *
[download]

The ones marked with "*" are the "no-changes", which should make up the exclusion character set in the final match. I.e., with the above modified transliteration, you should change that to

    while ($diff =~ /([^\x0b\x1b\x03\x79\x60])/g) {
[download]

Comment on Re^6: mismatching characters in dna sequence Select or Download Code

Replies are listed 'Best First'.
Re^7: mismatching characters in dna sequence by prbndr (Acolyte) on Dec 30, 2011 at 05:54 UTC
thank you! the reason i didn't mention it was because i filter out anything with an N prior to this tabulation of conversions. i asked more because i wanted to understand the transliteration process. i hope this is relatively quick for around 10 million sequences!	[reply]