Well, at a first glance, you want to encode A C G T. 4 possible characters, that's nice, you can encode that in two bits. You could therefore stuff 3 characters in 6 bits, which means 64 separate characters.

AAA => 0 AAC => 1 AAG => 2 AAT => 3 ACA => 4 ... TTT => 63

Add those numbers to a reasonable offset, and we can coax nice ASCII characters out them with the help of chr and recover them with ord.

#! /usr/bin/perl -w use strict; # build our encoder and decoder my $offset = 0; my %encode; my %decode; for my $x( qw/A C G T/ ) { for my $y( qw/A C G T/ ) { for my $z( qw/A C G T/ ) { $encode{"$x$y$z"} = $offset; $decode{$offset} = "$x$y$z"; $offset++; } } } # encode the string my $in = shift || 'AAAACTGACCGTTTT'; my $enc = ''; while( $in =~ /(...)/g ) { $enc .= chr( 48 + $encode{$1} ); } print "encoded: $enc\n"; # and back again my $dec = ''; for( split //, $enc ) { $dec .= $decode{ ord($_) - 48 }; } print "decoded: $dec\n";

That almost works, as a proof of concept, except it contains some nasty encoded characters (as least as far as URLs are concerned) such as ? / ; and \. These are going to do Weird Things. In any event, it encodes the sample string AAAACTGACCGTTTT to 07QKo. You'll need to build up the encode and decode hashes from 0-1, a-z, A-Z and - and _ or some other safe characters (you want 64, remember?)

Secondly, it contains a bad feature, in that the strings must be exact multiples three. What you would have to do is pad out the string with Cs or Ts or whatever (only two chars extra at most will be needed), and add a clip parameter that is equal to 0, 1 or 2, to tell how many chars to lop off a decoded string.

This may or may not be of help, or use, but it was fun to write :)

update: This was so much fun I couldn't stop thinking about it over dinner, and so here is a new and improved version. It offers URL-friendly encoding and it deals correctly with sequences that are not multiples of three (in the way I describe above). In re-reading the initial question, I see you want to use additional characters as well. In that case this scheme starts to fall apart. But you should be able to encode your markup in another parameter, and reference the parameter that holds this sequence in a templatish sort of way.

#! /usr/bin/perl -w use strict; # build our encoder and decoder my @char = ( 0..9, 'a'..'z', 'A'..'Z', qw/- _/ ); my %encode; my %decode; for my $x( qw/A C G T/ ) { for my $y( qw/A C G T/ ) { for my $z( qw/A C G T/ ) { my $ch = shift @char; $encode{"$x$y$z"} = $ch; $decode{$ch} = "$x$y$z"; } } } # encode the string my $in = shift || 'AAAACTGACCGTTTT'; my $enc = ''; my $chunk; my $chunk_len; while( defined( $chunk = substr( $in, 0, 3, '' ))) { $chunk_len = length( $chunk ); last if $chunk_len < 3; $enc .= $encode{$chunk}; } # deal with leftover chunk if( $chunk_len ) { my $pad = 3 - $chunk_len; $enc = $pad . $enc . $encode{$chunk . 'A' x $pad}; } else { $enc = "0$enc"; } print "encoded: $enc\n"; # and back again my $dec = ''; my $pad = substr( $enc, 0, 1, '' ); $dec .= $decode{$_} for split //, $enc; $dec = substr( $dec, 0, -$pad ) if $pad; print "decoded: $dec\n";

As an added bonus, there's no more need to deal with ord and chr (of course, that would have been possible in the initial version too).

All those uses of the magic number 3 should probably be abstracted to a constant, but I better stop now.


print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'

In reply to Re: compress data to pass in url? (here is a simple encoder and decoder) by grinder
in thread compress data to pass in url? by glwtta

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.