Re: Huffman coding in pure Perl

I have learned something today. Thank you.

All the Huffman coders I've seen (to date) have used a priority queue while constructing the tree -- I, and many others, favour a heap, given that only partial sorting is required.

Your code uses two queues, one for the original character (leaf) nodes and another for the internal (combined) nodes. During the tree construction, the min_dequeue subroutine has to make choice between the two queues. I stared at this for a while, and couldn't help feeling it was unnecessarily complicated. Then I noticed that new combined nodes are just pushed onto the @combined list, without any further sorting. This looked like a bug to me.... However, I have more or less convinced myself that this is more subtle than it looks, and the procedure ensures that the @combined list remains in order.

I imagine that the correctness can be proved by induction.

If your code is for pedalogical purposes, it would be good to explain the subtlety and provide the proof (which I'd love to see, BTW !).

The preference for taking items from the leaves (@weights) list is also subtle -- it minimises tree height.

Your code fully sorts leaf weights once, at the cost of the overhead in min_dequeue. A priority queue implementation doesn't have to fully sort, but does have to adjust the queue each time two nodes are combined, but then the choice of nodes to combine is trivial. I wonder what the trade off is -- which might be interesting from a pedalogical perspective. Of course, for a Perl implementation without a heap to hand....

Allocating the codes by walking the tree is a straightforward application of the technique. The fact that this produces a unique prefix code never ceases to amaze -- when you look at them, the codes can appear entirely arbitrary.

For extra credit (as they say) is the problem of how to represent a given encoding, particularly when sending it in-band with the compressed data -- when you want to minimise the overhead. While not being high-tone-genuine-computer-science-shite, it is of pratical concern !

Your code is relying heavily on hashes, which makes perfect sense in Perl. For Huffman encode, you don't have much choice, you need some sort of table to map from symbol to encoding. For Huffman decode, you can reconstruct the binary tree and walk that, but that's not cheap... so what does one do ?

When I last did Huffman, I decoupled the code allocation from the tree walk: when walking the tree I allocated only a code length to each symbol. Having done that, I allocated "huffman ordinals" and code values to the symbols, starting with 0 for lowest numbered symbol at the shortest length. For example:

  Symbol : Weight : Length : huff ord : Code
  -------:--------:--------:----------:------
    'l'  :   55   :   2    :     0    : 00
    'o'  :   31   :   2    :     1    : 01
    '!'  :   13   :   3    :     2    : 100
    'H'  :   14   :   3    :     3    : 101
    ','  :   10   :   4    :     4    : 1100
    'd'  :    8   :   4    :     5    : 1101
    'r'  :   12   :   4    :     6    : 1110
    '\0' :    1   :   6    :     7    : 111100
    ' '  :    1   :   6    :     8    : 111101
    'e'  :    1   :   6    :     9    : 111110
    'w'  :    2   :   6    :    10    : 111111

This encoding can then be represented as three arrays:

  1. the symbol lengths:               [ 2  3  4  6 ]
  2. number of symbols at each length: [ 2  2  3  4 ]
  3. the symbols by "huffman ordinal"  [ 'l' 'o' '!' 'H'  ',' 'd' 'r' '\0' ' ' 'e' 'w']

which can be used to drive the decoder, directly -- no other table or index or anything else is required.

Anyway, and FWIW, I enclose the guts of a Huffman module which uses a priority queue and allocates codes as above. Any and all comments gratefully received, noting that I'm not pretending it's shippable in it's current state !

Smacking the bits around is obviously not something that Perl excels at, so this is definitely written with a view to translating the encode/decode to, say, 'C'.

Before you conclude that this code is actually translated from 'C' to Perl in the first place, let me just say that the first time I tangled with this stuff, it was in Z80 assembler :-). (When men were men, and sheep were nervous.)

Finally, if the tree construction does not take steps to minimise the tree height, the following can happen:

         :        :      w-ber       ::     oshalla
  Symbol : Weight : Length :  Code   :: Length :  Code
  -------:--------:--------:---------::------------------
   ' '   :    8   :    3   :  111    ::    2   :  00
   'o'   :    5   :    3   :  101    ::    3   :  010
   'a'   :    2   :    4   :  0001   ::    4   :  0110
   'd'   :    3   :    4   :  1100   ::    4   :  0111
   'e'   :    2   :    4   :  0100   ::    4   :  1000
   'l'   :    3   :    4   :  1101   ::    4   :  1001
   'r'   :    2   :    4   :  0010   ::    4   :  1010
   'w'   :    2   :    4   :  0000   ::    4   :  1011
   '\0'  :    1   :    5   :  01010  ::    5   :  11000
   '!'   :    1   :    5   :  01111  ::    5   :  11001
   ','   :    1   :    5   :  01011  ::    5   :  11010
   '?'   :    1   :    5   :  10010  ::    5   :  11011
   'y'   :    2   :    4   :  0011   ::    5   :  11100
   'A'   :    1   :    5   :  10000  ::    6   :  111010
   'H'   :    1   :    5   :  10011  ::    6   :  111011
   'h'   :    1   :    5   :  01101  ::    6   :  111100
   'n'   :    1   :    5   :  10001  ::    6   :  111101
   't'   :    1   :    5   :  01110  ::    6   :  111110
   'u'   :    1   :    5   :  01100  ::    6   :  111111

as I discovered in testing. Both encodings gave exactly the same length encoded strings -- so they are both "optimal", just different.

Anyway, thank you for a wonderful time. I really must get home, now.

package Huffman_GMCH ;

use strict ;
use warnings ;

#=====================================================================
+====================
# huffman -- take array of symbol counts and return an encoding
#
# The array of symbol counts is assumed to be indexed by symbol ordina
+l.  Any symbol
# with a zero count will have no representation in the resulting code.
#
# The encoding is represented as an array of arrays: [ \@lengths, \@co
+unts, \@symbols ]
#
#   where: @lengths is a list of symbol lengths used, in ascending ord
+er
#
#          @counts  is a list of the number of symbols encoded at the 
+corresponding length
#
#          @symbols is a list of the symbol values, sorted first by en
+coded length, 
#                                                     and second by sy
+mbol ordinal.
#
# It's not strictly necessary to sort by symbol ordinal, but this look
+s tidy, and (more
# importantly) it can be used to advantage when encoding the encoding.
#
# Requires: $r_counts   -- ref array of symbol counts
#
# Returns:  $r_code     -- [ \@lengths, \@counts, \@symbols ] -- see a
+bove

sub huffman {
  my ($r_counts) = @_ ;

  # Fill the @tree list with leaf nodes, ignoring symbols with zero co
+unts

  my @tree = () ;

  for (my $s = 0 ; $s <= $#{$r_counts} ; $s++) {
    if ($r_counts->[$s]) { push @tree, [$r_counts->[$s], $s] } ;
  } ;

  # Build the tree but successively combining the two lowest weight no
+des.
  #
  # We sort the node list each time around the loop.  This is a poor m
+an's priority
  # queue.  With luck the sort is fast where the list is mostly ordere
+d.
  #
  # NB: this sorts into ascending order of weigth (hence $b->[0] <=> $
+a->[0]).
  #
  # NB: where weights are equal, we treat an internal node as slightly
+ heavier, which
  #     has the effect of minimising tree height.  The extension to th
+e comparison will
  #     leave two internal nodes in the current order.  Two leaf nodes
+ will be sorted
  #     so that the *smaller* ordinal has the *greater* weight (hence 
+the reversed
  #     order in the final '<=>').

  while ($#tree) {
    @tree = sort { ($b->[0] <=> $a->[0])  || (  ref($b->[1]) ? +1
                                              : ref($a->[1]) ? -1 
                                              : $a->[1] <=> $b->[1] )
                 } @tree ;

    my $r = pop @tree ;
    my $l = pop @tree ;

    push @tree, [$l->[0] + $r->[0], $l, $r] ;
  } ;

  # Establish the encoded length of the symbols

  my @symbol_lengths = () ;

  _huff_walk(pop @tree, 0, \@symbol_lengths) ;

  # Now generate and return the encoding

  my @code = () ;
  for my $s (0..$#symbol_lengths) {
    if (my $l = $symbol_lengths[$s]) { push @{$code[$l]}, $s ; } ;
  } ;

  my @lengths = () ;
  my @counts  = () ;
  my @symbols = () ;

  for my $l (1..$#code) {
    if (my $ss = $code[$l]) {
      push @lengths, $l ;
      push @counts,  scalar(@{$ss}) ;
      push @symbols, @{$ss} ;
    } ;
  } ;

  return [\@lengths, \@counts, \@symbols] ;
} ;

sub _huff_walk {
  my ($node, $length, $r_lengths) = @_ ;

  my ($p, $l, $r) = @$node ;

  if (ref($l)) {
    $length++ ;
    _huff_walk($l, $length, $r_lengths) ;
    _huff_walk($r, $length, $r_lengths) ;
  }
  else {
    $r_lengths->[$l] = $length ;
  } ;

  return ;
} ;

#=====================================================================
+====================
# huffman_encode_map -- construct encode map from given encoding
#
# The encode map is an array indexed by symbol ordinal giving its Huff
+man encoding.
#
# Symbols with no encoding will map to undef (of course).
#
# The huffman encoding is represented as an integer, with the LS bit b
+eing the first bit
# to transmit.  Above the MS bit is a "guard bit".  This particular re
+presentation allows
# an output loop of the form:
#
#   while ($v != 1) { vec($m, $p++, 1) = $v & 1 ; $v =>> 1 ; } ; 
#
# Requires: $code       -- [ \@lengths, \@counts, \@symbols ] -- as re
+turned by huffman()
#
# Returns:  $r_emap     -- ref:Array as described above

sub huffman_encode_map {
  my ($code) = @_ ;

  my ($r_lengths, $r_counts, $r_symbols) = @{$code} ;

  my @emap = () ;

  my $hs = 0 ;                  # Huffman symbol value
  my $hl = 0 ;                  # Current huffman symbol length
  my $hc = 0 ;                  # Count of symbols remaining at curren
+t length

  my $li = 0 ;                  # Index of next symbol length to use

  foreach my $s (@{$r_symbols}) {
    if ($hc == 0) {             # If exhausted current symbol length, 
+advance to next
      my $l = $hl ;
      $hl = $r_lengths->[$li] ;
      $hc = $r_counts->[$li] ;
      $hs <<= ($hl - $l) ;      # Increased symbol length
      $li++ ;                   # Index of next length to use
    } ;

    my $e = 1 ;                 # Start map entry with "guard bit"
    my $v = $hs ;               # Copy of of huffman symbol, to be pro
+cessed LS bit first
    for (1..$hl) {
      $e <<= 1 ;
      $e += $v & 1 ;
      $v >>= 1 ;
    } ;

    $emap[$s] = $e ;            # Set map entry

    $hs++ ;                     # Next huffman symbol
    $hc-- ;                     # Count down symbols ar current length
  } ;

  return \@emap ;
} ;

#=====================================================================
+====================
# huffman_enc -- encode given string into given bit-vector according t
+o given encoding map
#
# Bit-Vector is updated, in place.
#
# Requires: $r_emap     -- encoding map as generated by huffman_encode
+_map()
#           $vec        -- vector to encode to   *** updated in place 
+***
#           $p          -- next bit of vector
#           $string     -- string to encode
#
# Returns:  new next bit of vector

sub huffman_enc {
  my ($r_emap, undef, $p, $string) = @_ ;

  my $v ;
  foreach my $ch (split //, $string) {
    if (defined($v = $r_emap->[ord($ch)])) {
      while ($v != 1) { vec($_[1], $p++, 1) = $v & 1 ; $v >>= 1 ; } ;
    }
    else {
      die "No encoding for character '$ch'" ;
    } ;
  } ;

  return $p ;
} ;

#=====================================================================
+====================
# huffman_dec -- decode one symbol from given bit-vector, using given 
+encoding
#
# The bit pointer is updated, in place.
#
# Requires: $code       -- huffman encoding as returned huffman()
#           $vec        -- vector to decode
#           $p          -- next bit of vector    *** updated in place 
+***
#
# Returns:  next symbol ordinal

sub huffman_dec {
  my ($code, $vec, $p) = @_ ;

  my ($r_lengths, $r_counts, $r_symbols) = @{$code} ;

  my $hs = 0 ;                  # Huffman symbol
  my $hl = 0 ;                  # Current huffman symbol length
  my $hc = 0 ;                  # Count of symbols at current length

  my $li = 0 ;                  # Next length index

  my $v  = 0 ;                  # What we've read so far

  do {
    $hs += $hc ;                # Step to first symbol at next length

    my $l = $hl ;
    $hl = $r_lengths->[$li] ;   # Next length

    $v -= $hc ;                 # Strip prefix for current length
    while ($l < $hl) {          # Fetch bits to next length
      $v <<= 1 ;
      $v += vec($vec, $p++, 1) ;
      $l++ ;
    } ;

    $hc = $r_counts->[$li] ;    # Count of symbols at new length

    $li++ ;                     # Index of next length to use
  } until ($v < $hc) ;

  $_[2] = $p ;                  # Update pointer -- in place

  return $r_symbols->[$hs+$v] ;
} ;

1 ;
[download]

Comment on Re: Huffman coding in pure Perl Select or Download Code


Perl: the Markov chain saw
	PerlMonks