Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Huffman coding in pure Perl

by vrk (Chaplain)
on Mar 04, 2007 at 11:35 UTC ( [id://603111]=sourcecode: print w/replies, xml ) Need Help??
Category: Miscellaneous
Author/Contact Info
Description:

This is a straightforward implementation of Huffman coding in pure Perl. The module is not intended to be used in real applications as such; rather, this is an educational module, which can be used to demonstrate how it can be done.

Included in the module documentation are a simple compression and an equally simple decompression program, which demonstrate how the module can be used in compressing text files.

UPDATE: use 5.008_000 instead of 5.000_008, of course. Thanks, geekphilosopher!

UPDATE 2: Removed the use line altogether. tye has good arguments why it's harmful.


package Huffman;

=head1 NAME

Huffman - Simple Huffman coding in pure Perl.

=head1 SYNOPSIS

 use Huffman qw(count_frequencies generate_huff huff dehuff);

 # Get the letter frequencies for the data we are interested in.
 $frequencies = count_frequencies(@some_strings);

 # Create a very bad but functional frequency list for English
 # letters and punctuation characters: ASCII characters from
 # 32 (space) to 126 (~), plus a couple of normal control
 # charaters. Seriously, this is likely the worst possible
 # frequency list.
 $frequencies = count_frequencies(
    "\n", "\t", "\r",
    map { chr } 32 .. 126
    );

 # Generate the Huffman code.
 my $code = generate_huff($frequencies);

 # The code can now be used in encoding.
 my $encoded_data = huff('Hello, world!', $code);

 print dehuff($encoded_data, $code), "\n";
 # Prints: Hello, world!
 
 # Comparison:
 print unpack('b*', 'Hello, world!'), "\n";
 print unpack('b*', huff('Hello, world!', $code)), "\n";


=head1 REQUIRES

Perl 5.8, L<List::Util>, L<Exporter>

=cut

use strict;
use warnings;
use List::Util qw(min);

use Exporter;
use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
@ISA = qw(Exporter);

=head1 EXPORTS

Nothing by default. Exportable methods:

=over

=item - count_frequencies

=item - generate_huff

=item - huff

=item - dehuff

=back

Use

 use Huffman qw(:all);

to import everything.

=cut

@EXPORT = ();
@EXPORT_OK = qw(
    count_frequencies
    generate_huff
    huff
    dehuff
    );
%EXPORT_TAGS = ('all' => [ @EXPORT_OK ]);


=head1 DESCRIPTION

C<Huffman> implements Huffman coding in pure Perl. This is 
an educational module (at least the author felt educated after
implementing it), and should not really be used in serious
applications. The implementation is based on
L<http://en.wikipedia.org/wiki/Huffman_coding>; in particular,
C<generate_huff> uses the linear time algorithm. (However, it
sorts the weights first, which means it is actually O(n log n).)

The main inefficiency is that bit operations are not really bit
operations; rather, the module uses ASCII strings of 0s and 1s,
and only packs it into a real bit string after the encoded string
has been constructed. Similarly, while decoding, the module first
unpacks the bit string into an ASCII string, and only then does the
work. This means that both C<huff> and C<dehuff> need eight times
more memory to store the encoded string than is really needed.

The upside is that it was very easy and straightforward to
implement this way.

Creating a simple compression utility is not hard, either.
For example, this would do just fine:

 use strict;
 use warnings;
 use Huffman qw(:all);
 use Data::Dumper;

 for my $filename (@ARGV) {
     my $compressed_name = $filename . '.huff';

     if (-e $compressed_name) {
         warn "Will not overwrite $compressed_name";
         next;
     }

     open (FILE, '<', $filename) or die $!;
     $/ = undef;
     my $data = <FILE>;
     close FILE;

     my $freq = count_frequencies($data);
     my $code = generate_huff($freq);

     my $compressed = huff($data, $code);

     $Data::Dumper::Indent = 0;
     $Data::Dumper::Pair = '=>';
     open (FILE, '>', $compressed_name) or die $!;
     print FILE Dumper([ $code, $compressed ]);
     close FILE;
    
     # That's all, folks.
 }


The decompressor utility can then be written thus:

 use strict;
 use warnings;
 use Huffman qw(:all);

 for my $filename (@ARGV) {
     if (not -e $filename) {
         warn "$filename does not exist";
         next;
     }

     if ($filename !~ /\.huff$/) {
         warn "$filename does not end in .huff; will not process";
         next;
     }

     my $decompressed_name = $filename;
     $decompressed_name =~ s/\.huff$//;

     if (-e $decompressed_name) {
         warn "Will not overwrite $decompressed_name";
         next;
     }
     
     open (FILE, '<', $filename) or die $!;
     $/ = undef; 
     my ($code, $compressed);
     # BEWARE: using eval with arbitrary strings is dangerous!
     # Don't do this at home.
     { our $VAR1; eval <FILE>; ($code, $compressed) = @$VAR1; };
     close FILE;
     
     if ($@) {
         warn $@;
         next;
     }

     my $data = dehuff($compressed, $code);

     open(FILE, '>', $decompressed_name) or die $!;
     print FILE $data;
     close FILE;
 }


Simple, no? Of course, the programs above are inefficient with
large files, because they slurp the whole file into memory. Also,
there is only minimal error checking. A real-world compression and
decompression utility will need I<many> checks to guard against
bogus data, which have been omitted here for simplicity.


=head1 METHODS

=over

=item $frequencies = count_frequencies(@strings);

C<count_weights> counts the number of frequencies for each letter
appearing in @strings. These are returned in a hash where the keys
are the letters (once only, of course) and the values are their
frequencies.  A convenience function, really; the implementation
is simply a foreach (split//) loop.

Example:

 my $frequencies = count_frequencies('aab');
 use Data::Dumper;
 print Dumper($frequencies);
 
 # Output:
 #$VAR1 = {
 #          'a' => 2,
 #          'b' => 1
 #        };

=cut

sub count_frequencies {
    my (@strings) = @_;

    my $frequencies = ();
    for (@strings) {
        for (split//) {
            $frequencies->{$_}++;
        }
    }

    return $frequencies;
}


=item $huffman_code = generate_huff($weights);

C<generate_huff> generates the optimal Huffman code given $weights,
which is a hash whose keys are the alphabet and whose values are
the weights (of each letter in the alphabet). The weights are
supposed to be integers at least 1.

Example:

 my $code = generate_huff({
    'a' => 1,
    'b' => 2
 });
 use Data::Dumper;
 print Dumper($code);

 # Output:
 #$VAR1 = {
 #          '_eos' => '10',
 #          'a' => '11',
 #          'b' => '0'
 #        };


The function returns the Huffman code for the alphabet, including
a special symbol '_eos', which can be used in encoding and decoding
to denote the end of stream.

=cut

# O(n log n) implementation, straightforwardly from
# http://en.wikipedia.org/wiki/Huffman_coding.
sub generate_huff {
    my $weights = shift;

    my @weights;
    push @weights, { letter => '_eos', weight => 1, 0 => undef, 1 => u
+ndef};
    push @weights, map { 
            { letter => $_, weight => $weights->{$_}, 0 => undef, 1 =>
+ undef }
        } sort { $weights->{$a} <=> $weights->{$b} } keys %$weights;
    my @combined = ();

    local *min_dequeue = sub {
        return shift @weights if (not @combined);
        return shift @combined if (not @weights);

        # Using less than or equal here to break ties in favour
        # of the first queue.
        return shift @weights 
            if ($weights[0]->{weight} <= $combined[0]->{weight});
        return shift @combined;
    };

    while (@weights + @combined > 1) {
        my $first = min_dequeue();
        my $second = min_dequeue();
        push @combined, { 
            weight => $first->{weight} + $second->{weight},
            0 => $first,
            1 => $second,
        };
    }

    my $ret = ();
    local *preorder = sub {
        my ($node, $path) = @_;

        # Letters are always in the leaf nodes. (Otherwise an equivale
+nt
        # check would be to see if the node has children.)
        if (defined $node->{letter}) {
            $ret->{$node->{letter}} = $path;
            return;
        }
        preorder($node->{0}, $path . 0) if (defined $node->{0});
        preorder($node->{1}, $path . 1) if (defined $node->{1});
    };

    my $root = shift @combined;
    preorder($root, '');

    return $ret;
}


=item $encoded = huff($string, $huffman_code);

C<huff> encodes $string using the Huffman code in $huffman_code,
and returns the encoded bit string $huffman_code must be a hash,
and each character in $string must exist as a key in the hash;
in other words, each character in $string must have an encoding.

Example:

 my $enc = huff('aaba', { 'a' => '11', 'b' => '0', '_eos' => '10' });
 print unpack('b*', $enc);
 # Output: 1111011100000000
 # 

(unpack() appends extra zeroes at the end in order to
align the output bits to an octet boundary. This is OK, and
this is the reason we have the _eos marker: if the encoded
bit string is written to a file, it will be aligned similarly
to an octet boundary. C<dehuff> will know when to stop
decoding when it encounters _eos.)

The values in $huffman_code must be strings. The hash
must also contain the key '_eos', whose value indicates
the end of stream. C<huff> will append this at the
end of the encoded bit string.

=cut

sub huff {
    my $string = shift;
    my $codes = shift;

    my $ret = '';
    for (split//, $string) {
        $ret .= $codes->{$_};
    }
    $ret .= $codes->{_eos};

    return pack('b*', $ret);
}


=item $decoded = dehuff($string, $huffman_code);

C<dehuff> decodes a Huffman-encoded string $string using
the Huffman code in $huffman_code. The restrictions of
$huffman_code is the same as with C<huff>. $string must
be a bit string.

Example:

 my $code = { 'a' => '11', 'b' => '0', '_eos' => '10'};

 print dehuff(huff('aaba', $code), $code), "\n";
 # Output: aaba

=cut

sub dehuff { 
    my $string = shift;
    my $code = shift;
    my %decode = map { $code->{$_} => $_ } keys %$code;

    my $ret = ''; my $c = '';
    for (split//, unpack('b*', $string)) {
        $c .= $_;
        next unless (exists $decode{$c});
        last if ($decode{$c} eq '_eos');

        $ret .= $decode{$c};
        $c = '';
    }

    return $ret;
}


=back

=head1 SEE ALSO

A great tutorial into Huffman coding can be found at
L<http://www.cs.duke.edu/csed/poop/huff/info/>.

CPAN also has a module called L<Algorithm::Huffman> by Janek
Schleicher, which apparently works very much like this module. (The
author discovered its existance only after writing the module.)

=head1 AUTHOR

Ville R. Koskinen, E<lt>w-ber@iki.fiE<gt>

=head1 COPYRIGHT

I, the author of this work, hereby release it into the public domain. 

In other words, do whatever you wish to do with this piece of code.

=cut


1;
Replies are listed 'Best First'.
Re: Huffman coding in pure Perl
by Anno (Deacon) on Mar 04, 2007 at 19:20 UTC
    The main inefficiency is that bit operations are not really bit operations; rather, the module uses ASCII strings of 0s and 1s, and only packs it into a real bit string after the encoded string has been constructed. ...

    On the encoding side, it is only slightly harder to build the bit vector directly, using vec() instead of pack 'b'.

    sub huff1 { my ( $string, $codes) = @_; my ( $ret, $len) = ( '', 0); # $len counts bits for ( split( //, $string), '_eos' ) { vec( $ret, $len ++, 1) = $_ for split //, $codes->{ $_}; } return $ret; }
    The decoding side is harder.

    Update: No, it isn't. Dehuff is as easily adapted to vec().

    sub dehuff1 { my ( $string, $code) = @_; my %decode = reverse %$code; my ( $ret, $c, $where) = ( '', '', 0); while ( 1 ) { $c .= vec( $string, $where ++, 1); next unless exists $decode{ $c}; last if $decode{ $c} eq '_eos'; $ret .= $decode{ $c}; $c = ''; } return $ret; }

    Anno

      I know. I originally had vec in mind, but doing the decoding that way felt like too much of a drag, and I don't really need this module. The current implementation is arguably easier to understand, which was what I wanted to do: understand the algorithm.

      --
      print "Just Another Perl Adept\n";

Re: Huffman coding in pure Perl
by geekphilosopher (Friar) on Mar 04, 2007 at 17:26 UTC

      Yes, thank you. That was a silly typo. Fixed now.

      --
      print "Just Another Perl Adept\n";

        Why is that line even there at all? The meaning of that line is "This is guaranteed to not work in versions prior to 5.008". Why doesn't it work in those older versions? It certainly compiles fine (once that line is removed).

        So it appears that the purpose of that line may simply be to guarantee that the module won't work for versions prior to 5.008 ( do you dislike old Perl versions? :). You should at least comment the line so people know why you know your module is incompatible with pre-5.008 Perls. Or don't include such lines if you don't actually know that your module "requires 5.008".

        - tye        

Re: Huffman coding in pure Perl
by gone2015 (Deacon) on Nov 14, 2008 at 19:08 UTC

    I have learned something today. Thank you.

    All the Huffman coders I've seen (to date) have used a priority queue while constructing the tree -- I, and many others, favour a heap, given that only partial sorting is required.

    Your code uses two queues, one for the original character (leaf) nodes and another for the internal (combined) nodes. During the tree construction, the min_dequeue subroutine has to make choice between the two queues. I stared at this for a while, and couldn't help feeling it was unnecessarily complicated. Then I noticed that new combined nodes are just pushed onto the @combined list, without any further sorting. This looked like a bug to me.... However, I have more or less convinced myself that this is more subtle than it looks, and the procedure ensures that the @combined list remains in order.

    I imagine that the correctness can be proved by induction.

    If your code is for pedalogical purposes, it would be good to explain the subtlety and provide the proof (which I'd love to see, BTW !).

    The preference for taking items from the leaves (@weights) list is also subtle -- it minimises tree height.

    Your code fully sorts leaf weights once, at the cost of the overhead in min_dequeue. A priority queue implementation doesn't have to fully sort, but does have to adjust the queue each time two nodes are combined, but then the choice of nodes to combine is trivial. I wonder what the trade off is -- which might be interesting from a pedalogical perspective. Of course, for a Perl implementation without a heap to hand....

    Allocating the codes by walking the tree is a straightforward application of the technique. The fact that this produces a unique prefix code never ceases to amaze -- when you look at them, the codes can appear entirely arbitrary.

    For extra credit (as they say) is the problem of how to represent a given encoding, particularly when sending it in-band with the compressed data -- when you want to minimise the overhead. While not being high-tone-genuine-computer-science-shite, it is of pratical concern !

    Your code is relying heavily on hashes, which makes perfect sense in Perl. For Huffman encode, you don't have much choice, you need some sort of table to map from symbol to encoding. For Huffman decode, you can reconstruct the binary tree and walk that, but that's not cheap... so what does one do ?

    When I last did Huffman, I decoupled the code allocation from the tree walk: when walking the tree I allocated only a code length to each symbol. Having done that, I allocated "huffman ordinals" and code values to the symbols, starting with 0 for lowest numbered symbol at the shortest length. For example:

      Symbol : Weight : Length : huff ord : Code
      -------:--------:--------:----------:------
        'l'  :   55   :   2    :     0    : 00
        'o'  :   31   :   2    :     1    : 01
        '!'  :   13   :   3    :     2    : 100
        'H'  :   14   :   3    :     3    : 101
        ','  :   10   :   4    :     4    : 1100
        'd'  :    8   :   4    :     5    : 1101
        'r'  :   12   :   4    :     6    : 1110
        '\0' :    1   :   6    :     7    : 111100
        ' '  :    1   :   6    :     8    : 111101
        'e'  :    1   :   6    :     9    : 111110
        'w'  :    2   :   6    :    10    : 111111
    
    This encoding can then be represented as three arrays:
      1. the symbol lengths:               [ 2  3  4  6 ]
      2. number of symbols at each length: [ 2  2  3  4 ]
      3. the symbols by "huffman ordinal"  [ 'l' 'o' '!' 'H'  ',' 'd' 'r' '\0' ' ' 'e' 'w']
    
    which can be used to drive the decoder, directly -- no other table or index or anything else is required.

    Anyway, and FWIW, I enclose the guts of a Huffman module which uses a priority queue and allocates codes as above. Any and all comments gratefully received, noting that I'm not pretending it's shippable in it's current state !

    Smacking the bits around is obviously not something that Perl excels at, so this is definitely written with a view to translating the encode/decode to, say, 'C'.

    Before you conclude that this code is actually translated from 'C' to Perl in the first place, let me just say that the first time I tangled with this stuff, it was in Z80 assembler :-). (When men were men, and sheep were nervous.)

    Finally, if the tree construction does not take steps to minimise the tree height, the following can happen:

             :        :      w-ber       ::     oshalla
      Symbol : Weight : Length :  Code   :: Length :  Code
      -------:--------:--------:---------::------------------
       ' '   :    8   :    3   :  111    ::    2   :  00
       'o'   :    5   :    3   :  101    ::    3   :  010
       'a'   :    2   :    4   :  0001   ::    4   :  0110
       'd'   :    3   :    4   :  1100   ::    4   :  0111
       'e'   :    2   :    4   :  0100   ::    4   :  1000
       'l'   :    3   :    4   :  1101   ::    4   :  1001
       'r'   :    2   :    4   :  0010   ::    4   :  1010
       'w'   :    2   :    4   :  0000   ::    4   :  1011
       '\0'  :    1   :    5   :  01010  ::    5   :  11000
       '!'   :    1   :    5   :  01111  ::    5   :  11001
       ','   :    1   :    5   :  01011  ::    5   :  11010
       '?'   :    1   :    5   :  10010  ::    5   :  11011
       'y'   :    2   :    4   :  0011   ::    5   :  11100
       'A'   :    1   :    5   :  10000  ::    6   :  111010
       'H'   :    1   :    5   :  10011  ::    6   :  111011
       'h'   :    1   :    5   :  01101  ::    6   :  111100
       'n'   :    1   :    5   :  10001  ::    6   :  111101
       't'   :    1   :    5   :  01110  ::    6   :  111110
       'u'   :    1   :    5   :  01100  ::    6   :  111111
    
    as I discovered in testing. Both encodings gave exactly the same length encoded strings -- so they are both "optimal", just different.

    Anyway, thank you for a wonderful time. I really must get home, now.

      Where is the final result of the code where to find it after dhe compile finish the work ?

Re: Huffman coding in pure Perl
by Anonymous Monk on Nov 14, 2008 at 07:32 UTC
    Hi! This variant of count_frequencies works much faster than original:
    sub count_frequencies { my (@strings) = @_; my $frequencies = (); for (@strings) { foreach my $i (0..length($_)-1) { $frequencies->{substr($_, $i,1)}++; } } return $frequencies; }
Re: Huffman coding in pure Perl
by Sixtease (Friar) on Mar 11, 2007 at 23:48 UTC
    use warnings in modules can cause troubles. Or at least this thread seems to suggest so.
Re: Huffman coding in pure Perl
by ysth (Canon) on Mar 12, 2007 at 23:58 UTC

      Indeed. I am inclined to say that mine is easier to grok, though. Red Perl part I has quite a few sharp edges that can severly hurt your digestive system.

      --
      print "Just Another Perl Adept\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://603111]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-19 22:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found