Re: A better (problem-specific) perl hash?

I'm sure that someone must have already had this insight befoire, but on the basis that you can encode a, c, g, t in two bits, you can encode four of them in a byte. To save time, you just have to create a couple lookup tables to encode and decode at will.

For instance, with the following code, the sequence 'ctaacgccctagcgta' encodes to four bytes whose ASCII representation is, surprise, 'perl'. Going back the other way, the encoded bytes whose ASCII representation is 'japh' comes out as 'cgggcgacctaacgga'.

For this to work, you need sequences whose lengths are exact multiples of four. Ensuring that that is the case is left as an exercise to the reader.

#! /usr/bin/perl -w

use strict;

my %lookup;
my %invert;

my @base = qw(a c g t);

my $count = 0;
for my $x1 (@base) {
for my $x2 (@base) {
for my $x3 (@base) {
for my $x4 (@base) {
    my $key   = chr($count++);
    my $chunk = "$x1$x2$x3$x4";
    $invert{$key} = $chunk;
    $lookup{$chunk} = $key;
} } } }

for my $seq (@ARGV) {
    print "squeeze   => ", squeeze($seq), "\n";
    print "unsqueeze => ", unsqueeze($seq), "\n";
}

sub squeeze {
    my $seq = shift;
    my $out = '';
    $out .= $lookup{lc $1} while ($seq =~ /(....)/g);
    return $out;
}

sub unsqueeze {
    my $seq = shift;
    my $out = '';
    $out .= $invert{$_} for split //, $seq;
    return $out;
}
[download]

I suspect that this is going to be as about as fast and efficient as you can get. Just take care when you transfer the encoded data to a UTF-8 system :)

• another intruder with the mooring in the heart of the Perl

Comment on Re: A better (problem-specific) perl hash? Download Code