http://qs1969.pair.com?node_id=603111
Category: Miscellaneous
Author/Contact Info
Description:

This is a straightforward implementation of Huffman coding in pure Perl. The module is not intended to be used in real applications as such; rather, this is an educational module, which can be used to demonstrate how it can be done.

Included in the module documentation are a simple compression and an equally simple decompression program, which demonstrate how the module can be used in compressing text files.

UPDATE: use 5.008_000 instead of 5.000_008, of course. Thanks, geekphilosopher!

UPDATE 2: Removed the use line altogether. tye has good arguments why it's harmful.


package Huffman;

=head1 NAME

Huffman - Simple Huffman coding in pure Perl.

=head1 SYNOPSIS

 use Huffman qw(count_frequencies generate_huff huff dehuff);

 # Get the letter frequencies for the data we are interested in.
 $frequencies = count_frequencies(@some_strings);

 # Create a very bad but functional frequency list for English
 # letters and punctuation characters: ASCII characters from
 # 32 (space) to 126 (~), plus a couple of normal control
 # charaters. Seriously, this is likely the worst possible
 # frequency list.
 $frequencies = count_frequencies(
    "\n", "\t", "\r",
    map { chr } 32 .. 126
    );

 # Generate the Huffman code.
 my $code = generate_huff($frequencies);

 # The code can now be used in encoding.
 my $encoded_data = huff('Hello, world!', $code);

 print dehuff($encoded_data, $code), "\n";
 # Prints: Hello, world!
 
 # Comparison:
 print unpack('b*', 'Hello, world!'), "\n";
 print unpack('b*', huff('Hello, world!', $code)), "\n";


=head1 REQUIRES

Perl 5.8, L<List::Util>, L<Exporter>

=cut

use strict;
use warnings;
use List::Util qw(min);

use Exporter;
use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
@ISA = qw(Exporter);

=head1 EXPORTS

Nothing by default. Exportable methods:

=over

=item - count_frequencies

=item - generate_huff

=item - huff

=item - dehuff

=back

Use

 use Huffman qw(:all);

to import everything.

=cut

@EXPORT = ();
@EXPORT_OK = qw(
    count_frequencies
    generate_huff
    huff
    dehuff
    );
%EXPORT_TAGS = ('all' => [ @EXPORT_OK ]);


=head1 DESCRIPTION

C<Huffman> implements Huffman coding in pure Perl. This is 
an educational module (at least the author felt educated after
implementing it), and should not really be used in serious
applications. The implementation is based on
L<http://en.wikipedia.org/wiki/Huffman_coding>; in particular,
C<generate_huff> uses the linear time algorithm. (However, it
sorts the weights first, which means it is actually O(n log n).)

The main inefficiency is that bit operations are not really bit
operations; rather, the module uses ASCII strings of 0s and 1s,
and only packs it into a real bit string after the encoded string
has been constructed. Similarly, while decoding, the module first
unpacks the bit string into an ASCII string, and only then does the
work. This means that both C<huff> and C<dehuff> need eight times
more memory to store the encoded string than is really needed.

The upside is that it was very easy and straightforward to
implement this way.

Creating a simple compression utility is not hard, either.
For example, this would do just fine:

 use strict;
 use warnings;
 use Huffman qw(:all);
 use Data::Dumper;

 for my $filename (@ARGV) {
     my $compressed_name = $filename . '.huff';

     if (-e $compressed_name) {
         warn "Will not overwrite $compressed_name";
         next;
     }

     open (FILE, '<', $filename) or die $!;
     $/ = undef;
     my $data = <FILE>;
     close FILE;

     my $freq = count_frequencies($data);
     my $code = generate_huff($freq);

     my $compressed = huff($data, $code);

     $Data::Dumper::Indent = 0;
     $Data::Dumper::Pair = '=>';
     open (FILE, '>', $compressed_name) or die $!;
     print FILE Dumper([ $code, $compressed ]);
     close FILE;
    
     # That's all, folks.
 }


The decompressor utility can then be written thus:

 use strict;
 use warnings;
 use Huffman qw(:all);

 for my $filename (@ARGV) {
     if (not -e $filename) {
         warn "$filename does not exist";
         next;
     }

     if ($filename !~ /\.huff$/) {
         warn "$filename does not end in .huff; will not process";
         next;
     }

     my $decompressed_name = $filename;
     $decompressed_name =~ s/\.huff$//;

     if (-e $decompressed_name) {
         warn "Will not overwrite $decompressed_name";
         next;
     }
     
     open (FILE, '<', $filename) or die $!;
     $/ = undef; 
     my ($code, $compressed);
     # BEWARE: using eval with arbitrary strings is dangerous!
     # Don't do this at home.
     { our $VAR1; eval <FILE>; ($code, $compressed) = @$VAR1; };
     close FILE;
     
     if ($@) {
         warn $@;
         next;
     }

     my $data = dehuff($compressed, $code);

     open(FILE, '>', $decompressed_name) or die $!;
     print FILE $data;
     close FILE;
 }


Simple, no? Of course, the programs above are inefficient with
large files, because they slurp the whole file into memory. Also,
there is only minimal error checking. A real-world compression and
decompression utility will need I<many> checks to guard against
bogus data, which have been omitted here for simplicity.


=head1 METHODS

=over

=item $frequencies = count_frequencies(@strings);

C<count_weights> counts the number of frequencies for each letter
appearing in @strings. These are returned in a hash where the keys
are the letters (once only, of course) and the values are their
frequencies.  A convenience function, really; the implementation
is simply a foreach (split//) loop.

Example:

 my $frequencies = count_frequencies('aab');
 use Data::Dumper;
 print Dumper($frequencies);
 
 # Output:
 #$VAR1 = {
 #          'a' => 2,
 #          'b' => 1
 #        };

=cut

sub count_frequencies {
    my (@strings) = @_;

    my $frequencies = ();
    for (@strings) {
        for (split//) {
            $frequencies->{$_}++;
        }
    }

    return $frequencies;
}


=item $huffman_code = generate_huff($weights);

C<generate_huff> generates the optimal Huffman code given $weights,
which is a hash whose keys are the alphabet and whose values are
the weights (of each letter in the alphabet). The weights are
supposed to be integers at least 1.

Example:

 my $code = generate_huff({
    'a' => 1,
    'b' => 2
 });
 use Data::Dumper;
 print Dumper($code);

 # Output:
 #$VAR1 = {
 #          '_eos' => '10',
 #          'a' => '11',
 #          'b' => '0'
 #        };


The function returns the Huffman code for the alphabet, including
a special symbol '_eos', which can be used in encoding and decoding
to denote the end of stream.

=cut

# O(n log n) implementation, straightforwardly from
# http://en.wikipedia.org/wiki/Huffman_coding.
sub generate_huff {
    my $weights = shift;

    my @weights;
    push @weights, { letter => '_eos', weight => 1, 0 => undef, 1 => u
+ndef};
    push @weights, map { 
            { letter => $_, weight => $weights->{$_}, 0 => undef, 1 =>
+ undef }
        } sort { $weights->{$a} <=> $weights->{$b} } keys %$weights;
    my @combined = ();

    local *min_dequeue = sub {
        return shift @weights if (not @combined);
        return shift @combined if (not @weights);

        # Using less than or equal here to break ties in favour
        # of the first queue.
        return shift @weights 
            if ($weights[0]->{weight} <= $combined[0]->{weight});
        return shift @combined;
    };

    while (@weights + @combined > 1) {
        my $first = min_dequeue();
        my $second = min_dequeue();
        push @combined, { 
            weight => $first->{weight} + $second->{weight},
            0 => $first,
            1 => $second,
        };
    }

    my $ret = ();
    local *preorder = sub {
        my ($node, $path) = @_;

        # Letters are always in the leaf nodes. (Otherwise an equivale
+nt
        # check would be to see if the node has children.)
        if (defined $node->{letter}) {
            $ret->{$node->{letter}} = $path;
            return;
        }
        preorder($node->{0}, $path . 0) if (defined $node->{0});
        preorder($node->{1}, $path . 1) if (defined $node->{1});
    };

    my $root = shift @combined;
    preorder($root, '');

    return $ret;
}


=item $encoded = huff($string, $huffman_code);

C<huff> encodes $string using the Huffman code in $huffman_code,
and returns the encoded bit string $huffman_code must be a hash,
and each character in $string must exist as a key in the hash;
in other words, each character in $string must have an encoding.

Example:

 my $enc = huff('aaba', { 'a' => '11', 'b' => '0', '_eos' => '10' });
 print unpack('b*', $enc);
 # Output: 1111011100000000
 # 

(unpack() appends extra zeroes at the end in order to
align the output bits to an octet boundary. This is OK, and
this is the reason we have the _eos marker: if the encoded
bit string is written to a file, it will be aligned similarly
to an octet boundary. C<dehuff> will know when to stop
decoding when it encounters _eos.)

The values in $huffman_code must be strings. The hash
must also contain the key '_eos', whose value indicates
the end of stream. C<huff> will append this at the
end of the encoded bit string.

=cut

sub huff {
    my $string = shift;
    my $codes = shift;

    my $ret = '';
    for (split//, $string) {
        $ret .= $codes->{$_};
    }
    $ret .= $codes->{_eos};

    return pack('b*', $ret);
}


=item $decoded = dehuff($string, $huffman_code);

C<dehuff> decodes a Huffman-encoded string $string using
the Huffman code in $huffman_code. The restrictions of
$huffman_code is the same as with C<huff>. $string must
be a bit string.

Example:

 my $code = { 'a' => '11', 'b' => '0', '_eos' => '10'};

 print dehuff(huff('aaba', $code), $code), "\n";
 # Output: aaba

=cut

sub dehuff { 
    my $string = shift;
    my $code = shift;
    my %decode = map { $code->{$_} => $_ } keys %$code;

    my $ret = ''; my $c = '';
    for (split//, unpack('b*', $string)) {
        $c .= $_;
        next unless (exists $decode{$c});
        last if ($decode{$c} eq '_eos');

        $ret .= $decode{$c};
        $c = '';
    }

    return $ret;
}


=back

=head1 SEE ALSO

A great tutorial into Huffman coding can be found at
L<http://www.cs.duke.edu/csed/poop/huff/info/>.

CPAN also has a module called L<Algorithm::Huffman> by Janek
Schleicher, which apparently works very much like this module. (The
author discovered its existance only after writing the module.)

=head1 AUTHOR

Ville R. Koskinen, E<lt>w-ber@iki.fiE<gt>

=head1 COPYRIGHT

I, the author of this work, hereby release it into the public domain. 

In other words, do whatever you wish to do with this piece of code.

=cut


1;