Huffman coding in pure Perl

Category:	Miscellaneous
Author/Contact Info
Description:	This is a straightforward implementation of Huffman coding in pure Perl. The module is not intended to be used in real applications as such; rather, this is an educational module, which can be used to demonstrate how it can be done. Included in the module documentation are a simple compression and an equally simple decompression program, which demonstrate how the module can be used in compressing text files. UPDATE: use 5.008_000 instead of 5.000_008, of course. Thanks, geekphilosopher! UPDATE 2: Removed the use line altogether. tye has good arguments why it's harmful.
package Huffman; =head1 NAME Huffman - Simple Huffman coding in pure Perl. =head1 SYNOPSIS use Huffman qw(count_frequencies generate_huff huff dehuff); # Get the letter frequencies for the data we are interested in. $frequencies = count_frequencies(@some_strings); # Create a very bad but functional frequency list for English # letters and punctuation characters: ASCII characters from # 32 (space) to 126 (~), plus a couple of normal control # charaters. Seriously, this is likely the worst possible # frequency list. $frequencies = count_frequencies( "\n", "\t", "\r", map { chr } 32 .. 126 ); # Generate the Huffman code. my $code = generate_huff($frequencies); # The code can now be used in encoding. my $encoded_data = huff('Hello, world!', $code); print dehuff($encoded_data, $code), "\n"; # Prints: Hello, world! # Comparison: print unpack('b', 'Hello, world!'), "\n"; print unpack('b', huff('Hello, world!', $code)), "\n"; =head1 REQUIRES Perl 5.8, L<List::Util>, L<Exporter> =cut use strict; use warnings; use List::Util qw(min); use Exporter; use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); @ISA = qw(Exporter); =head1 EXPORTS Nothing by default. Exportable methods: =over =item - count_frequencies =item - generate_huff =item - huff =item - dehuff =back Use use Huffman qw(:all); to import everything. =cut @EXPORT = (); @EXPORT_OK = qw( count_frequencies generate_huff huff dehuff ); %EXPORT_TAGS = ('all' => [ @EXPORT_OK ]); =head1 DESCRIPTION C<Huffman> implements Huffman coding in pure Perl. This is an educational module (at least the author felt educated after implementing it), and should not really be used in serious applications. The implementation is based on L<http://en.wikipedia.org/wiki/Huffman_coding>; in particular, C<generate_huff> uses the linear time algorithm. (However, it sorts the weights first, which means it is actually O(n log n).) The main inefficiency is that bit operations are not really bit operations; rather, the module uses ASCII strings of 0s and 1s, and only packs it into a real bit string after the encoded string has been constructed. Similarly, while decoding, the module first unpacks the bit string into an ASCII string, and only then does the work. This means that both C<huff> and C<dehuff> need eight times more memory to store the encoded string than is really needed. The upside is that it was very easy and straightforward to implement this way. Creating a simple compression utility is not hard, either. For example, this would do just fine: use strict; use warnings; use Huffman qw(:all); use Data::Dumper; for my $filename (@ARGV) { my $compressed_name = $filename . '.huff'; if (-e $compressed_name) { warn "Will not overwrite $compressed_name"; next; } open (FILE, '<', $filename) or die $!; $/ = undef; my $data = <FILE>; close FILE; my $freq = count_frequencies($data); my $code = generate_huff($freq); my $compressed = huff($data, $code); $Data::Dumper::Indent = 0; $Data::Dumper::Pair = '=>'; open (FILE, '>', $compressed_name) or die $!; print FILE Dumper([ $code, $compressed ]); close FILE; # That's all, folks. } The decompressor utility can then be written thus: use strict; use warnings; use Huffman qw(:all); for my $filename (@ARGV) { if (not -e $filename) { warn "$filename does not exist"; next; } if ($filename !~ /\.huff$/) { warn "$filename does not end in .huff; will not process"; next; } my $decompressed_name = $filename; $decompressed_name =~ s/\.huff$//; if (-e $decompressed_name) { warn "Will not overwrite $decompressed_name"; next; } open (FILE, '<', $filename) or die $!; $/ = undef; my ($code, $compressed); # BEWARE: using eval with arbitrary strings is dangerous! # Don't do this at home. { our $VAR1; eval <FILE>; ($code, $compressed) = @$VAR1; }; close FILE; if ($@) { warn $@; next; } my $data = dehuff($compressed, $code); open(FILE, '>', $decompressed_name) or die $!; print FILE $data; close FILE; } Simple, no? Of course, the programs above are inefficient with large files, because they slurp the whole file into memory. Also, there is only minimal error checking. A real-world compression and decompression utility will need I<many> checks to guard against bogus data, which have been omitted here for simplicity. =head1 METHODS =over =item $frequencies = count_frequencies(@strings); C<count_weights> counts the number of frequencies for each letter appearing in @strings. These are returned in a hash where the keys are the letters (once only, of course) and the values are their frequencies. A convenience function, really; the implementation is simply a foreach (split//) loop. Example: my $frequencies = count_frequencies('aab'); use Data::Dumper; print Dumper($frequencies); # Output: #$VAR1 = { # 'a' => 2, # 'b' => 1 # }; =cut sub count_frequencies { my (@strings) = @_; my $frequencies = (); for (@strings) { for (split//) { $frequencies->{$_}++; } } return $frequencies; } =item $huffman_code = generate_huff($weights); C<generate_huff> generates the optimal Huffman code given $weights, which is a hash whose keys are the alphabet and whose values are the weights (of each letter in the alphabet). The weights are supposed to be integers at least 1. Example: my $code = generate_huff({ 'a' => 1, 'b' => 2 }); use Data::Dumper; print Dumper($code); # Output: #$VAR1 = { # '_eos' => '10', # 'a' => '11', # 'b' => '0' # }; The function returns the Huffman code for the alphabet, including a special symbol '_eos', which can be used in encoding and decoding to denote the end of stream. =cut # O(n log n) implementation, straightforwardly from # http://en.wikipedia.org/wiki/Huffman_coding. sub generate_huff { my $weights = shift; my @weights; push @weights, { letter => '_eos', weight => 1, 0 => undef, 1 => u +ndef}; push @weights, map { { letter => $_, weight => $weights->{$_}, 0 => undef, 1 => + undef } } sort { $weights->{$a} <=> $weights->{$b} } keys %$weights; my @combined = (); local min_dequeue = sub { return shift @weights if (not @combined); return shift @combined if (not @weights); # Using less than or equal here to break ties in favour # of the first queue. return shift @weights if ($weights[0]->{weight} <= $combined[0]->{weight}); return shift @combined; }; while (@weights + @combined > 1) { my $first = min_dequeue(); my $second = min_dequeue(); push @combined, { weight => $first->{weight} + $second->{weight}, 0 => $first, 1 => $second, }; } my $ret = (); local preorder = sub { my ($node, $path) = @_; # Letters are always in the leaf nodes. (Otherwise an equivale +nt # check would be to see if the node has children.) if (defined $node->{letter}) { $ret->{$node->{letter}} = $path; return; } preorder($node->{0}, $path . 0) if (defined $node->{0}); preorder($node->{1}, $path . 1) if (defined $node->{1}); }; my $root = shift @combined; preorder($root, ''); return $ret; } =item $encoded = huff($string, $huffman_code); C<huff> encodes $string using the Huffman code in $huffman_code, and returns the encoded bit string $huffman_code must be a hash, and each character in $string must exist as a key in the hash; in other words, each character in $string must have an encoding. Example: my $enc = huff('aaba', { 'a' => '11', 'b' => '0', '_eos' => '10' }); print unpack('b', $enc); # Output: 1111011100000000 # (unpack() appends extra zeroes at the end in order to align the output bits to an octet boundary. This is OK, and this is the reason we have the _eos marker: if the encoded bit string is written to a file, it will be aligned similarly to an octet boundary. C<dehuff> will know when to stop decoding when it encounters _eos.) The values in $huffman_code must be strings. The hash must also contain the key '_eos', whose value indicates the end of stream. C<huff> will append this at the end of the encoded bit string. =cut sub huff { my $string = shift; my $codes = shift; my $ret = ''; for (split//, $string) { $ret .= $codes->{$_}; } $ret .= $codes->{_eos}; return pack('b', $ret); } =item $decoded = dehuff($string, $huffman_code); C<dehuff> decodes a Huffman-encoded string $string using the Huffman code in $huffman_code. The restrictions of $huffman_code is the same as with C<huff>. $string must be a bit string. Example: my $code = { 'a' => '11', 'b' => '0', '_eos' => '10'}; print dehuff(huff('aaba', $code), $code), "\n"; # Output: aaba =cut sub dehuff { my $string = shift; my $code = shift; my %decode = map { $code->{$_} => $_ } keys %$code; my $ret = ''; my $c = ''; for (split//, unpack('b*', $string)) { $c .= $_; next unless (exists $decode{$c}); last if ($decode{$c} eq '_eos'); $ret .= $decode{$c}; $c = ''; } return $ret; } =back =head1 SEE ALSO A great tutorial into Huffman coding can be found at L<http://www.cs.duke.edu/csed/poop/huff/info/>. CPAN also has a module called L<Algorithm::Huffman> by Janek Schleicher, which apparently works very much like this module. (The author discovered its existance only after writing the module.) =head1 AUTHOR Ville R. Koskinen, E<lt>w-ber@iki.fiE<gt> =head1 COPYRIGHT I, the author of this work, hereby release it into the public domain. In other words, do whatever you wish to do with this piece of code. =cut 1;

Comment on Huffman coding in pure Perl Download Code

Replies are listed 'Best First'.
Re: Huffman coding in pure Perl by Anno (Deacon) on Mar 04, 2007 at 19:20 UTC
The main inefficiency is that bit operations are not really bit operations; rather, the module uses ASCII strings of 0s and 1s, and only packs it into a real bit string after the encoded string has been constructed. ... On the encoding side, it is only slightly harder to build the bit vector directly, using `vec()` instead of `pack 'b'`. `sub huff1 { my ( $string, $codes) = @_; my ( $ret, $len) = ( '', 0); # $len counts bits for ( split( //, $string), '_eos' ) { vec( $ret, $len ++, 1) = $_ for split //, $codes->{ $_}; } return $ret; }` [download] The decoding side is harder. Update: No, it isn't. Dehuff is as easily adapted to `vec()`. `sub dehuff1 { my ( $string, $code) = @_; my %decode = reverse %$code; my ( $ret, $c, $where) = ( '', '', 0); while ( 1 ) { $c .= vec( $string, $where ++, 1); next unless exists $decode{ $c}; last if $decode{ $c} eq '_eos'; $ret .= $decode{ $c}; $c = ''; } return $ret; }` [download] Anno	[reply] [d/l] [select]
Re^2: Huffman coding in pure Perl by vrk (Chaplain) on Mar 04, 2007 at 19:46 UTC
I know. I originally had `vec` in mind, but doing the decoding that way felt like too much of a drag, and I don't really need this module. The current implementation is arguably easier to understand, which was what I wanted to do: understand the algorithm. -- print "Just Another Perl Adept\n";	[reply] [d/l]
Re: Huffman coding in pure Perl by geekphilosopher (Friar) on Mar 04, 2007 at 17:26 UTC
The line `use 5.000_008;` [download] says you require Perl 5.0.8. Perhaps you meant `use 5.008_000;` [download] See perldoc -f use for more info.	[reply] [d/l] [select]
Re^2: Huffman coding in pure Perl by merlyn (Sage) on Mar 04, 2007 at 18:55 UTC
"require 5.0.8": Must run on a Perl released in the past twelve years. {grin} -- Randal L. Schwartz, Perl hacker	[reply]
Re^2: Huffman coding in pure Perl by vrk (Chaplain) on Mar 04, 2007 at 17:47 UTC
Yes, thank you. That was a silly typo. Fixed now. -- print "Just Another Perl Adept\n";	[reply]
Re^3: Huffman coding in pure Perl (why?) by tye (Sage) on Mar 04, 2007 at 19:51 UTC
Why is that line even there at all? The meaning of that line is "This is guaranteed to not work in versions prior to 5.008". Why doesn't it work in those older versions? It certainly compiles fine (once that line is removed). So it appears that the purpose of that line may simply be to guarantee that the module won't work for versions prior to 5.008 ( do you dislike old Perl versions? :). You should at least comment the line so people know why you know your module is incompatible with pre-5.008 Perls. Or don't include such lines if you don't actually know that your module "requires 5.008". - tye	[reply]
Re^4: Huffman coding in pure Perl (why?) by vrk (Chaplain) on Mar 04, 2007 at 20:22 UTC
Re^5: Huffman coding in pure Perl (why?) by tye (Sage) on Mar 04, 2007 at 21:34 UTC
Some notes below your chosen depth have not been shown here
Re: Huffman coding in pure Perl by gone2015 (Deacon) on Nov 14, 2008 at 19:08 UTC
I have learned something today. Thank you. All the Huffman coders I've seen (to date) have used a priority queue while constructing the tree -- I, and many others, favour a heap, given that only partial sorting is required. Your code uses two queues, one for the original character (leaf) nodes and another for the internal (combined) nodes. During the tree construction, the `min_dequeue` subroutine has to make choice between the two queues. I stared at this for a while, and couldn't help feeling it was unnecessarily complicated. Then I noticed that new combined nodes are just pushed onto the `@combined` list, without any further sorting. This looked like a bug to me.... However, I have more or less convinced myself that this is more subtle than it looks, and the procedure ensures that the `@combined` list remains in order. I imagine that the correctness can be proved by induction. If your code is for pedalogical purposes, it would be good to explain the subtlety and provide the proof (which I'd love to see, BTW !). The preference for taking items from the leaves (`@weights`) list is also subtle -- it minimises tree height. Your code fully sorts leaf weights once, at the cost of the overhead in `min_dequeue`. A priority queue implementation doesn't have to fully sort, but does have to adjust the queue each time two nodes are combined, but then the choice of nodes to combine is trivial. I wonder what the trade off is -- which might be interesting from a pedalogical perspective. Of course, for a Perl implementation without a heap to hand.... Allocating the codes by walking the tree is a straightforward application of the technique. The fact that this produces a unique prefix code never ceases to amaze -- when you look at them, the codes can appear entirely arbitrary. For extra credit (as they say) is the problem of how to represent a given encoding, particularly when sending it in-band with the compressed data -- when you want to minimise the overhead. While not being high-tone-genuine-computer-science-shite, it is of pratical concern ! Your code is relying heavily on hashes, which makes perfect sense in Perl. For Huffman encode, you don't have much choice, you need some sort of table to map from symbol to encoding. For Huffman decode, you can reconstruct the binary tree and walk that, but that's not cheap... so what does one do ? When I last did Huffman, I decoupled the code allocation from the tree walk: when walking the tree I allocated only a code length to each symbol. Having done that, I allocated "huffman ordinals" and code values to the symbols, starting with 0 for lowest numbered symbol at the shortest length. For example: Symbol : Weight : Length : huff ord : Code -------:--------:--------:----------:------ 'l' : 55 : 2 : 0 : 00 'o' : 31 : 2 : 1 : 01 '!' : 13 : 3 : 2 : 100 'H' : 14 : 3 : 3 : 101 ',' : 10 : 4 : 4 : 1100 'd' : 8 : 4 : 5 : 1101 'r' : 12 : 4 : 6 : 1110 '\0' : 1 : 6 : 7 : 111100 ' ' : 1 : 6 : 8 : 111101 'e' : 1 : 6 : 9 : 111110 'w' : 2 : 6 : 10 : 111111 This encoding can then be represented as three arrays: 1. the symbol lengths: [ 2 3 4 6 ] 2. number of symbols at each length: [ 2 2 3 4 ] 3. the symbols by "huffman ordinal" [ 'l' 'o' '!' 'H' ',' 'd' 'r' '\0' ' ' 'e' 'w'] which can be used to drive the decoder, directly -- no other table or index or anything else is required. Anyway, and FWIW, I enclose the guts of a Huffman module which uses a priority queue and allocates codes as above. Any and all comments gratefully received, noting that I'm not pretending it's shippable in it's current state ! Smacking the bits around is obviously not something that Perl excels at, so this is definitely written with a view to translating the encode/decode to, say, 'C'. Before you conclude that this code is actually translated from 'C' to Perl in the first place, let me just say that the first time I tangled with this stuff, it was in Z80 assembler :-). (When men were men, and sheep were nervous.) Finally, if the tree construction does not take steps to minimise the tree height, the following can happen: : : w-ber :: oshalla Symbol : Weight : Length : Code :: Length : Code -------:--------:--------:---------::------------------ ' ' : 8 : 3 : 111 :: 2 : 00 'o' : 5 : 3 : 101 :: 3 : 010 'a' : 2 : 4 : 0001 :: 4 : 0110 'd' : 3 : 4 : 1100 :: 4 : 0111 'e' : 2 : 4 : 0100 :: 4 : 1000 'l' : 3 : 4 : 1101 :: 4 : 1001 'r' : 2 : 4 : 0010 :: 4 : 1010 'w' : 2 : 4 : 0000 :: 4 : 1011 '\0' : 1 : 5 : 01010 :: 5 : 11000 '!' : 1 : 5 : 01111 :: 5 : 11001 ',' : 1 : 5 : 01011 :: 5 : 11010 '?' : 1 : 5 : 10010 :: 5 : 11011 'y' : 2 : 4 : 0011 :: 5 : 11100 'A' : 1 : 5 : 10000 :: 6 : 111010 'H' : 1 : 5 : 10011 :: 6 : 111011 'h' : 1 : 5 : 01101 :: 6 : 111100 'n' : 1 : 5 : 10001 :: 6 : 111101 't' : 1 : 5 : 01110 :: 6 : 111110 'u' : 1 : 5 : 01100 :: 6 : 111111 as I discovered in testing. Both encodings gave exactly the same length encoded strings -- so they are both "optimal", just different. Anyway, thank you for a wonderful time. I really must get home, now. Read more... See the code (10 kB)	[reply] [d/l] [select]
Re^2: Huffman coding in pure Perl by Anonymous Monk on Dec 27, 2012 at 22:58 UTC
Where is the final result of the code where to find it after dhe compile finish the work ?	[reply]
Re: Huffman coding in pure Perl by Anonymous Monk on Nov 14, 2008 at 07:32 UTC
Hi! This variant of count_frequencies works much faster than original: `sub count_frequencies { my (@strings) = @_; my $frequencies = (); for (@strings) { foreach my $i (0..length($_)-1) { $frequencies->{substr($_, $i,1)}++; } } return $frequencies; }` [download]	[reply] [d/l]
Re: Huffman coding in pure Perl by Sixtease (Friar) on Mar 11, 2007 at 23:48 UTC
`use warnings` in modules can cause troubles. Or at least this thread seems to suggest so.	[reply] [d/l]
Re: Huffman coding in pure Perl by ysth (Canon) on Mar 12, 2007 at 23:58 UTC
There's a slightly more compact decoder in Red Perl part I. I'm looking for a job.	[reply]
Re^2: Huffman coding in pure Perl by vrk (Chaplain) on Mar 13, 2007 at 13:53 UTC
Indeed. I am inclined to say that mine is easier to grok, though. Red Perl part I has quite a few sharp edges that can severly hurt your digestive system. -- print "Just Another Perl Adept\n";	[reply]


Don't ask to ask, just ask
	PerlMonks