Huffman coding in pure Perl

Category:	Miscellaneous
Author/Contact Info
Description:	This is a straightforward implementation of Huffman coding in pure Perl. The module is not intended to be used in real applications as such; rather, this is an educational module, which can be used to demonstrate how it can be done. Included in the module documentation are a simple compression and an equally simple decompression program, which demonstrate how the module can be used in compressing text files. UPDATE: use 5.008_000 instead of 5.000_008, of course. Thanks, geekphilosopher! UPDATE 2: Removed the use line altogether. tye has good arguments why it's harmful.
package Huffman; =head1 NAME Huffman - Simple Huffman coding in pure Perl. =head1 SYNOPSIS use Huffman qw(count_frequencies generate_huff huff dehuff); # Get the letter frequencies for the data we are interested in. $frequencies = count_frequencies(@some_strings); # Create a very bad but functional frequency list for English # letters and punctuation characters: ASCII characters from # 32 (space) to 126 (~), plus a couple of normal control # charaters. Seriously, this is likely the worst possible # frequency list. $frequencies = count_frequencies( "\n", "\t", "\r", map { chr } 32 .. 126 ); # Generate the Huffman code. my $code = generate_huff($frequencies); # The code can now be used in encoding. my $encoded_data = huff('Hello, world!', $code); print dehuff($encoded_data, $code), "\n"; # Prints: Hello, world! # Comparison: print unpack('b', 'Hello, world!'), "\n"; print unpack('b', huff('Hello, world!', $code)), "\n"; =head1 REQUIRES Perl 5.8, L<List::Util>, L<Exporter> =cut use strict; use warnings; use List::Util qw(min); use Exporter; use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); @ISA = qw(Exporter); =head1 EXPORTS Nothing by default. Exportable methods: =over =item - count_frequencies =item - generate_huff =item - huff =item - dehuff =back Use use Huffman qw(:all); to import everything. =cut @EXPORT = (); @EXPORT_OK = qw( count_frequencies generate_huff huff dehuff ); %EXPORT_TAGS = ('all' => [ @EXPORT_OK ]); =head1 DESCRIPTION C<Huffman> implements Huffman coding in pure Perl. This is an educational module (at least the author felt educated after implementing it), and should not really be used in serious applications. The implementation is based on L<http://en.wikipedia.org/wiki/Huffman_coding>; in particular, C<generate_huff> uses the linear time algorithm. (However, it sorts the weights first, which means it is actually O(n log n).) The main inefficiency is that bit operations are not really bit operations; rather, the module uses ASCII strings of 0s and 1s, and only packs it into a real bit string after the encoded string has been constructed. Similarly, while decoding, the module first unpacks the bit string into an ASCII string, and only then does the work. This means that both C<huff> and C<dehuff> need eight times more memory to store the encoded string than is really needed. The upside is that it was very easy and straightforward to implement this way. Creating a simple compression utility is not hard, either. For example, this would do just fine: use strict; use warnings; use Huffman qw(:all); use Data::Dumper; for my $filename (@ARGV) { my $compressed_name = $filename . '.huff'; if (-e $compressed_name) { warn "Will not overwrite $compressed_name"; next; } open (FILE, '<', $filename) or die $!; $/ = undef; my $data = <FILE>; close FILE; my $freq = count_frequencies($data); my $code = generate_huff($freq); my $compressed = huff($data, $code); $Data::Dumper::Indent = 0; $Data::Dumper::Pair = '=>'; open (FILE, '>', $compressed_name) or die $!; print FILE Dumper([ $code, $compressed ]); close FILE; # That's all, folks. } The decompressor utility can then be written thus: use strict; use warnings; use Huffman qw(:all); for my $filename (@ARGV) { if (not -e $filename) { warn "$filename does not exist"; next; } if ($filename !~ /\.huff$/) { warn "$filename does not end in .huff; will not process"; next; } my $decompressed_name = $filename; $decompressed_name =~ s/\.huff$//; if (-e $decompressed_name) { warn "Will not overwrite $decompressed_name"; next; } open (FILE, '<', $filename) or die $!; $/ = undef; my ($code, $compressed); # BEWARE: using eval with arbitrary strings is dangerous! # Don't do this at home. { our $VAR1; eval <FILE>; ($code, $compressed) = @$VAR1; }; close FILE; if ($@) { warn $@; next; } my $data = dehuff($compressed, $code); open(FILE, '>', $decompressed_name) or die $!; print FILE $data; close FILE; } Simple, no? Of course, the programs above are inefficient with large files, because they slurp the whole file into memory. Also, there is only minimal error checking. A real-world compression and decompression utility will need I<many> checks to guard against bogus data, which have been omitted here for simplicity. =head1 METHODS =over =item $frequencies = count_frequencies(@strings); C<count_weights> counts the number of frequencies for each letter appearing in @strings. These are returned in a hash where the keys are the letters (once only, of course) and the values are their frequencies. A convenience function, really; the implementation is simply a foreach (split//) loop. Example: my $frequencies = count_frequencies('aab'); use Data::Dumper; print Dumper($frequencies); # Output: #$VAR1 = { # 'a' => 2, # 'b' => 1 # }; =cut sub count_frequencies { my (@strings) = @_; my $frequencies = (); for (@strings) { for (split//) { $frequencies->{$_}++; } } return $frequencies; } =item $huffman_code = generate_huff($weights); C<generate_huff> generates the optimal Huffman code given $weights, which is a hash whose keys are the alphabet and whose values are the weights (of each letter in the alphabet). The weights are supposed to be integers at least 1. Example: my $code = generate_huff({ 'a' => 1, 'b' => 2 }); use Data::Dumper; print Dumper($code); # Output: #$VAR1 = { # '_eos' => '10', # 'a' => '11', # 'b' => '0' # }; The function returns the Huffman code for the alphabet, including a special symbol '_eos', which can be used in encoding and decoding to denote the end of stream. =cut # O(n log n) implementation, straightforwardly from # http://en.wikipedia.org/wiki/Huffman_coding. sub generate_huff { my $weights = shift; my @weights; push @weights, { letter => '_eos', weight => 1, 0 => undef, 1 => u +ndef}; push @weights, map { { letter => $_, weight => $weights->{$_}, 0 => undef, 1 => + undef } } sort { $weights->{$a} <=> $weights->{$b} } keys %$weights; my @combined = (); local min_dequeue = sub { return shift @weights if (not @combined); return shift @combined if (not @weights); # Using less than or equal here to break ties in favour # of the first queue. return shift @weights if ($weights[0]->{weight} <= $combined[0]->{weight}); return shift @combined; }; while (@weights + @combined > 1) { my $first = min_dequeue(); my $second = min_dequeue(); push @combined, { weight => $first->{weight} + $second->{weight}, 0 => $first, 1 => $second, }; } my $ret = (); local preorder = sub { my ($node, $path) = @_; # Letters are always in the leaf nodes. (Otherwise an equivale +nt # check would be to see if the node has children.) if (defined $node->{letter}) { $ret->{$node->{letter}} = $path; return; } preorder($node->{0}, $path . 0) if (defined $node->{0}); preorder($node->{1}, $path . 1) if (defined $node->{1}); }; my $root = shift @combined; preorder($root, ''); return $ret; } =item $encoded = huff($string, $huffman_code); C<huff> encodes $string using the Huffman code in $huffman_code, and returns the encoded bit string $huffman_code must be a hash, and each character in $string must exist as a key in the hash; in other words, each character in $string must have an encoding. Example: my $enc = huff('aaba', { 'a' => '11', 'b' => '0', '_eos' => '10' }); print unpack('b', $enc); # Output: 1111011100000000 # (unpack() appends extra zeroes at the end in order to align the output bits to an octet boundary. This is OK, and this is the reason we have the _eos marker: if the encoded bit string is written to a file, it will be aligned similarly to an octet boundary. C<dehuff> will know when to stop decoding when it encounters _eos.) The values in $huffman_code must be strings. The hash must also contain the key '_eos', whose value indicates the end of stream. C<huff> will append this at the end of the encoded bit string. =cut sub huff { my $string = shift; my $codes = shift; my $ret = ''; for (split//, $string) { $ret .= $codes->{$_}; } $ret .= $codes->{_eos}; return pack('b', $ret); } =item $decoded = dehuff($string, $huffman_code); C<dehuff> decodes a Huffman-encoded string $string using the Huffman code in $huffman_code. The restrictions of $huffman_code is the same as with C<huff>. $string must be a bit string. Example: my $code = { 'a' => '11', 'b' => '0', '_eos' => '10'}; print dehuff(huff('aaba', $code), $code), "\n"; # Output: aaba =cut sub dehuff { my $string = shift; my $code = shift; my %decode = map { $code->{$_} => $_ } keys %$code; my $ret = ''; my $c = ''; for (split//, unpack('b*', $string)) { $c .= $_; next unless (exists $decode{$c}); last if ($decode{$c} eq '_eos'); $ret .= $decode{$c}; $c = ''; } return $ret; } =back =head1 SEE ALSO A great tutorial into Huffman coding can be found at L<http://www.cs.duke.edu/csed/poop/huff/info/>. CPAN also has a module called L<Algorithm::Huffman> by Janek Schleicher, which apparently works very much like this module. (The author discovered its existance only after writing the module.) =head1 AUTHOR Ville R. Koskinen, E<lt>w-ber@iki.fiE<gt> =head1 COPYRIGHT I, the author of this work, hereby release it into the public domain. In other words, do whatever you wish to do with this piece of code. =cut 1;

Back to Code Catacombs