What have you tried so far? Have you looked at the gzip or bzip2 programs? You can compress your file with them and then read from your file by opening a pipe to them:
my $packer = 'bzip2';
my $file = 'data.txt.bz2';
open my $fh, "$packer -cd $file |"
or die "Couldn't decompress '$file': $!/$?";
Alternatively, you could encode each of the four characters into two bits, thus storing four characters per byte. I guess this approach won't be more efficient space-wise than the gzip or bzip2 approach, but it retains the ability to do random reading in your file:
use strict;
my %charmap = (
A => '00',
C => '01',
G => '10',
T => '11',
);
my $string = 'GATTACA';
$string =~ s/(.)/$charmap{$1}/ge;
print "$string\n";
my $compressed = pack 'b*', $string;
print "$compressed\n";
printf "%d bytes\n", length $compressed;
# now use vec() to get at the single parts of $compressed
my $decompressed = unpack 'b*', $compressed;
print "$decompressed\n";
But have you looked at BioPerl? I'm pretty sure that they have support for that stuff. |