comment on

Personally I'd drop the A T C G approach altogether. As I understand it, with DNA a "word" is always three characters long. So "AACTCGATG" is really three words "AAC-TCG-ATG". There are 64 (i.e. 4^3) possible words, so each three letter word can be represented by a single byte in a binary string (with two bits left over).

Establish a mapping along the lines of AAA=chr(0), AAC=chr(1), AAG=chr(2), ..., GGG=chr(63). Then convert your "AACTCGATG" strings to binary strings when they are first input - keep them as binary strings everywhere internal within your program and just reformat them back to letters when you need to display them.

Your strings will be a third of the size of the original input strings, making string comparisons that much faster. (And yes, I'd use XOR to handle the comparisons - it's likely to be the fastest method.) Of course, if you've got millions of strings that need comparing, the initial import, converting from DNA letters into binary strings will add some overhead to your program, but because the number of possible comparisons scales exponentially with regard to the number of strings, the overhead is likely to be worth it.

I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment. If it's real data, then strings will always be a multiple of three characters, which is kinda important for the techniques I outline above to work. If it is fake data, then it's still doable - you just pad the end to a multiple of three characters, and then use those two spare bits on the last byte of the binary string to indicate how many padding letters there are. This does somewhat complicate implementation though.

In reply to Re: mismatching characters in dna sequence by tobyink
in thread mismatching characters in dna sequence by prbndr

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.