Well, at a first glance, you want to encode A C G T. 4 possible characters, that's nice, you can encode that in two bits. You could therefore stuff 3 characters in 6 bits, which means 64 separate characters.
AAA => 0
AAC => 1
AAG => 2
AAT => 3
ACA => 4
...
TTT => 63
Add those numbers to a reasonable offset, and we can coax nice ASCII characters out them with the help of chr and recover them with ord.
#! /usr/bin/perl -w
use strict;
# build our encoder and decoder
my $offset = 0;
my %encode;
my %decode;
for my $x( qw/A C G T/ ) {
for my $y( qw/A C G T/ ) {
for my $z( qw/A C G T/ ) {
$encode{"$x$y$z"} = $offset;
$decode{$offset} = "$x$y$z";
$offset++;
}
}
}
# encode the string
my $in = shift || 'AAAACTGACCGTTTT';
my $enc = '';
while( $in =~ /(...)/g ) {
$enc .= chr( 48 + $encode{$1} );
}
print "encoded: $enc\n";
# and back again
my $dec = '';
for( split //, $enc ) {
$dec .= $decode{ ord($_) - 48 };
}
print "decoded: $dec\n";
That almost works, as a proof of concept, except it contains some nasty encoded characters (as least as far as URLs are concerned) such as ? / ; and \. These are going to do Weird Things. In any event, it encodes the sample string AAAACTGACCGTTTT to 07QKo. You'll need to build up the encode and decode hashes from 0-1, a-z, A-Z and - and _ or some other safe characters (you want 64, remember?)
Secondly, it contains a bad feature, in that the strings must be exact multiples three. What you would have to do is pad out the string with Cs or Ts or whatever (only two chars extra at most will be needed), and add a clip parameter that is equal to 0, 1 or 2, to tell how many chars to lop off a decoded string.
This may or may not be of help, or use, but it was fun to write :)
update: This was so much fun I couldn't stop thinking about it over dinner, and so here is a new and improved version. It offers URL-friendly encoding and it deals correctly with sequences that are not multiples of three (in the way I describe above). In re-reading the initial question, I see you want to use additional characters as well. In that case this scheme starts to fall apart. But you should be able to encode your markup in another parameter, and reference the parameter that holds this sequence in a templatish sort of way.
#! /usr/bin/perl -w
use strict;
# build our encoder and decoder
my @char = ( 0..9, 'a'..'z', 'A'..'Z', qw/- _/ );
my %encode;
my %decode;
for my $x( qw/A C G T/ ) {
for my $y( qw/A C G T/ ) {
for my $z( qw/A C G T/ ) {
my $ch = shift @char;
$encode{"$x$y$z"} = $ch;
$decode{$ch} = "$x$y$z";
}
}
}
# encode the string
my $in = shift || 'AAAACTGACCGTTTT';
my $enc = '';
my $chunk;
my $chunk_len;
while( defined( $chunk = substr( $in, 0, 3, '' ))) {
$chunk_len = length( $chunk );
last if $chunk_len < 3;
$enc .= $encode{$chunk};
}
# deal with leftover chunk
if( $chunk_len ) {
my $pad = 3 - $chunk_len;
$enc = $pad . $enc . $encode{$chunk . 'A' x $pad};
}
else {
$enc = "0$enc";
}
print "encoded: $enc\n";
# and back again
my $dec = '';
my $pad = substr( $enc, 0, 1, '' );
$dec .= $decode{$_} for split //, $enc;
$dec = substr( $dec, 0, -$pad ) if $pad;
print "decoded: $dec\n";
As an added bonus, there's no more need to deal with ord and chr (of course, that would have been possible in the initial version too).
All those uses of the magic number 3 should probably be abstracted to a constant, but I better stop now.
print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u' |