in reply to mismatching characters in dna sequence

Personally I'd drop the A T C G approach altogether. As I understand it, with DNA a "word" is always three characters long. So "AACTCGATG" is really three words "AAC-TCG-ATG". There are 64 (i.e. 4^3) possible words, so each three letter word can be represented by a single byte in a binary string (with two bits left over).

Establish a mapping along the lines of AAA=chr(0), AAC=chr(1), AAG=chr(2), ..., GGG=chr(63). Then convert your "AACTCGATG" strings to binary strings when they are first input - keep them as binary strings everywhere internal within your program and just reformat them back to letters when you need to display them.

Your strings will be a third of the size of the original input strings, making string comparisons that much faster. (And yes, I'd use XOR to handle the comparisons - it's likely to be the fastest method.) Of course, if you've got millions of strings that need comparing, the initial import, converting from DNA letters into binary strings will add some overhead to your program, but because the number of possible comparisons scales exponentially with regard to the number of strings, the overhead is likely to be worth it.

I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment. If it's real data, then strings will always be a multiple of three characters, which is kinda important for the techniques I outline above to work. If it is fake data, then it's still doable - you just pad the end to a multiple of three characters, and then use those two spare bits on the last byte of the binary string to indicate how many padding letters there are. This does somewhat complicate implementation though.

  • Comment on Re: mismatching characters in dna sequence

Replies are listed 'Best First'.
Re^2: mismatching characters in dna sequence
by BrowserUk (Patriarch) on Dec 31, 2011 at 20:58 UTC
    As I understand it, with DNA a "word" is always three characters long.... I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment.

    The trouble with that idea is you are assuming that all DNA data is correctly transcribed. But often the very purpose of these programs is to detect errors in the transcriptions.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?