Personally I'd drop the A T C G approach altogether. As I understand it, with DNA a "word" is always three characters long. So "AACTCGATG" is really three words "AAC-TCG-ATG". There are 64 (i.e. 4^3) possible words, so each three letter word can be represented by a single byte in a binary string (with two bits left over).
Establish a mapping along the lines of AAA=chr(0), AAC=chr(1), AAG=chr(2), ..., GGG=chr(63). Then convert your "AACTCGATG" strings to binary strings when they are first input - keep them as binary strings everywhere internal within your program and just reformat them back to letters when you need to display them.
Your strings will be a third of the size of the original input strings, making string comparisons that much faster. (And yes, I'd use XOR to handle the comparisons - it's likely to be the fastest method.) Of course, if you've got millions of strings that need comparing, the initial import, converting from DNA letters into binary strings will add some overhead to your program, but because the number of possible comparisons scales exponentially with regard to the number of strings, the overhead is likely to be worth it.
I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment. If it's real data, then strings will always be a multiple of three characters, which is kinda important for the techniques I outline above to work. If it is fake data, then it's still doable - you just pad the end to a multiple of three characters, and then use those two spare bits on the last byte of the binary string to indicate how many padding letters there are. This does somewhat complicate implementation though.
In reply to Re: mismatching characters in dna sequence
by tobyink
in thread mismatching characters in dna sequence
by prbndr
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |