in reply to Compressing String for Printing

What have you tried so far? Have you looked at the gzip or bzip2 programs? You can compress your file with them and then read from your file by opening a pipe to them:

my $packer = 'bzip2'; my $file = 'data.txt.bz2'; open my $fh, "$packer -cd $file |" or die "Couldn't decompress '$file': $!/$?";

Alternatively, you could encode each of the four characters into two bits, thus storing four characters per byte. I guess this approach won't be more efficient space-wise than the gzip or bzip2 approach, but it retains the ability to do random reading in your file:

use strict; my %charmap = ( A => '00', C => '01', G => '10', T => '11', ); my $string = 'GATTACA'; $string =~ s/(.)/$charmap{$1}/ge; print "$string\n"; my $compressed = pack 'b*', $string; print "$compressed\n"; printf "%d bytes\n", length $compressed; # now use vec() to get at the single parts of $compressed my $decompressed = unpack 'b*', $compressed; print "$decompressed\n";

But have you looked at BioPerl? I'm pretty sure that they have support for that stuff.

Replies are listed 'Best First'.
Re^2: Compressing String for Printing
by eye (Chaplain) on Dec 26, 2008 at 07:15 UTC
    Corion's suggestion to use a compression program is a good suggestion. If the sequence represented by the strings are from coding regions, it is likely that some sub-sequences (codons) occur with much higher frequency than other sub-sequences (e.g., stop codons). In this case, certain types of compression algorithms can potentially achieve better compression than the 4:1 you'd get with bit packing.