I'm sure that someone must have already had this insight befoire, but on the basis that you can encode a, c, g, t in two bits, you can encode four of them in a byte. To save time, you just have to create a couple lookup tables to encode and decode at will.

For instance, with the following code, the sequence 'ctaacgccctagcgta' encodes to four bytes whose ASCII representation is, surprise, 'perl'. Going back the other way, the encoded bytes whose ASCII representation is 'japh' comes out as 'cgggcgacctaacgga'.

For this to work, you need sequences whose lengths are exact multiples of four. Ensuring that that is the case is left as an exercise to the reader.

#! /usr/bin/perl -w use strict; my %lookup; my %invert; my @base = qw(a c g t); my $count = 0; for my $x1 (@base) { for my $x2 (@base) { for my $x3 (@base) { for my $x4 (@base) { my $key = chr($count++); my $chunk = "$x1$x2$x3$x4"; $invert{$key} = $chunk; $lookup{$chunk} = $key; } } } } for my $seq (@ARGV) { print "squeeze => ", squeeze($seq), "\n"; print "unsqueeze => ", unsqueeze($seq), "\n"; } sub squeeze { my $seq = shift; my $out = ''; $out .= $lookup{lc $1} while ($seq =~ /(....)/g); return $out; } sub unsqueeze { my $seq = shift; my $out = ''; $out .= $invert{$_} for split //, $seq; return $out; }

I suspect that this is going to be as about as fast and efficient as you can get. Just take care when you transfer the encoded data to a UTF-8 system :)

• another intruder with the mooring in the heart of the Perl


In reply to Re: A better (problem-specific) perl hash? by grinder
in thread A better (problem-specific) perl hash? by srdst13

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.