Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Huffman coding in pure Perl

by gone2015 (Deacon)
on Nov 14, 2008 at 19:08 UTC ( [id://723710]=note: print w/replies, xml ) Need Help??


in reply to Huffman coding in pure Perl

I have learned something today. Thank you.

All the Huffman coders I've seen (to date) have used a priority queue while constructing the tree -- I, and many others, favour a heap, given that only partial sorting is required.

Your code uses two queues, one for the original character (leaf) nodes and another for the internal (combined) nodes. During the tree construction, the min_dequeue subroutine has to make choice between the two queues. I stared at this for a while, and couldn't help feeling it was unnecessarily complicated. Then I noticed that new combined nodes are just pushed onto the @combined list, without any further sorting. This looked like a bug to me.... However, I have more or less convinced myself that this is more subtle than it looks, and the procedure ensures that the @combined list remains in order.

I imagine that the correctness can be proved by induction.

If your code is for pedalogical purposes, it would be good to explain the subtlety and provide the proof (which I'd love to see, BTW !).

The preference for taking items from the leaves (@weights) list is also subtle -- it minimises tree height.

Your code fully sorts leaf weights once, at the cost of the overhead in min_dequeue. A priority queue implementation doesn't have to fully sort, but does have to adjust the queue each time two nodes are combined, but then the choice of nodes to combine is trivial. I wonder what the trade off is -- which might be interesting from a pedalogical perspective. Of course, for a Perl implementation without a heap to hand....

Allocating the codes by walking the tree is a straightforward application of the technique. The fact that this produces a unique prefix code never ceases to amaze -- when you look at them, the codes can appear entirely arbitrary.

For extra credit (as they say) is the problem of how to represent a given encoding, particularly when sending it in-band with the compressed data -- when you want to minimise the overhead. While not being high-tone-genuine-computer-science-shite, it is of pratical concern !

Your code is relying heavily on hashes, which makes perfect sense in Perl. For Huffman encode, you don't have much choice, you need some sort of table to map from symbol to encoding. For Huffman decode, you can reconstruct the binary tree and walk that, but that's not cheap... so what does one do ?

When I last did Huffman, I decoupled the code allocation from the tree walk: when walking the tree I allocated only a code length to each symbol. Having done that, I allocated "huffman ordinals" and code values to the symbols, starting with 0 for lowest numbered symbol at the shortest length. For example:

  Symbol : Weight : Length : huff ord : Code
  -------:--------:--------:----------:------
    'l'  :   55   :   2    :     0    : 00
    'o'  :   31   :   2    :     1    : 01
    '!'  :   13   :   3    :     2    : 100
    'H'  :   14   :   3    :     3    : 101
    ','  :   10   :   4    :     4    : 1100
    'd'  :    8   :   4    :     5    : 1101
    'r'  :   12   :   4    :     6    : 1110
    '\0' :    1   :   6    :     7    : 111100
    ' '  :    1   :   6    :     8    : 111101
    'e'  :    1   :   6    :     9    : 111110
    'w'  :    2   :   6    :    10    : 111111
This encoding can then be represented as three arrays:
  1. the symbol lengths:               [ 2  3  4  6 ]
  2. number of symbols at each length: [ 2  2  3  4 ]
  3. the symbols by "huffman ordinal"  [ 'l' 'o' '!' 'H'  ',' 'd' 'r' '\0' ' ' 'e' 'w']
which can be used to drive the decoder, directly -- no other table or index or anything else is required.

Anyway, and FWIW, I enclose the guts of a Huffman module which uses a priority queue and allocates codes as above. Any and all comments gratefully received, noting that I'm not pretending it's shippable in it's current state !

Smacking the bits around is obviously not something that Perl excels at, so this is definitely written with a view to translating the encode/decode to, say, 'C'.

Before you conclude that this code is actually translated from 'C' to Perl in the first place, let me just say that the first time I tangled with this stuff, it was in Z80 assembler :-). (When men were men, and sheep were nervous.)

Finally, if the tree construction does not take steps to minimise the tree height, the following can happen:

         :        :      w-ber       ::     oshalla
  Symbol : Weight : Length :  Code   :: Length :  Code
  -------:--------:--------:---------::------------------
   ' '   :    8   :    3   :  111    ::    2   :  00
   'o'   :    5   :    3   :  101    ::    3   :  010
   'a'   :    2   :    4   :  0001   ::    4   :  0110
   'd'   :    3   :    4   :  1100   ::    4   :  0111
   'e'   :    2   :    4   :  0100   ::    4   :  1000
   'l'   :    3   :    4   :  1101   ::    4   :  1001
   'r'   :    2   :    4   :  0010   ::    4   :  1010
   'w'   :    2   :    4   :  0000   ::    4   :  1011
   '\0'  :    1   :    5   :  01010  ::    5   :  11000
   '!'   :    1   :    5   :  01111  ::    5   :  11001
   ','   :    1   :    5   :  01011  ::    5   :  11010
   '?'   :    1   :    5   :  10010  ::    5   :  11011
   'y'   :    2   :    4   :  0011   ::    5   :  11100
   'A'   :    1   :    5   :  10000  ::    6   :  111010
   'H'   :    1   :    5   :  10011  ::    6   :  111011
   'h'   :    1   :    5   :  01101  ::    6   :  111100
   'n'   :    1   :    5   :  10001  ::    6   :  111101
   't'   :    1   :    5   :  01110  ::    6   :  111110
   'u'   :    1   :    5   :  01100  ::    6   :  111111
as I discovered in testing. Both encodings gave exactly the same length encoded strings -- so they are both "optimal", just different.

Anyway, thank you for a wonderful time. I really must get home, now.

package Huffman_GMCH ; use strict ; use warnings ; #===================================================================== +==================== # huffman -- take array of symbol counts and return an encoding # # The array of symbol counts is assumed to be indexed by symbol ordina +l. Any symbol # with a zero count will have no representation in the resulting code. # # The encoding is represented as an array of arrays: [ \@lengths, \@co +unts, \@symbols ] # # where: @lengths is a list of symbol lengths used, in ascending ord +er # # @counts is a list of the number of symbols encoded at the +corresponding length # # @symbols is a list of the symbol values, sorted first by en +coded length, # and second by sy +mbol ordinal. # # It's not strictly necessary to sort by symbol ordinal, but this look +s tidy, and (more # importantly) it can be used to advantage when encoding the encoding. # # Requires: $r_counts -- ref array of symbol counts # # Returns: $r_code -- [ \@lengths, \@counts, \@symbols ] -- see a +bove sub huffman { my ($r_counts) = @_ ; # Fill the @tree list with leaf nodes, ignoring symbols with zero co +unts my @tree = () ; for (my $s = 0 ; $s <= $#{$r_counts} ; $s++) { if ($r_counts->[$s]) { push @tree, [$r_counts->[$s], $s] } ; } ; # Build the tree but successively combining the two lowest weight no +des. # # We sort the node list each time around the loop. This is a poor m +an's priority # queue. With luck the sort is fast where the list is mostly ordere +d. # # NB: this sorts into ascending order of weigth (hence $b->[0] <=> $ +a->[0]). # # NB: where weights are equal, we treat an internal node as slightly + heavier, which # has the effect of minimising tree height. The extension to th +e comparison will # leave two internal nodes in the current order. Two leaf nodes + will be sorted # so that the *smaller* ordinal has the *greater* weight (hence +the reversed # order in the final '<=>'). while ($#tree) { @tree = sort { ($b->[0] <=> $a->[0]) || ( ref($b->[1]) ? +1 : ref($a->[1]) ? -1 : $a->[1] <=> $b->[1] ) } @tree ; my $r = pop @tree ; my $l = pop @tree ; push @tree, [$l->[0] + $r->[0], $l, $r] ; } ; # Establish the encoded length of the symbols my @symbol_lengths = () ; _huff_walk(pop @tree, 0, \@symbol_lengths) ; # Now generate and return the encoding my @code = () ; for my $s (0..$#symbol_lengths) { if (my $l = $symbol_lengths[$s]) { push @{$code[$l]}, $s ; } ; } ; my @lengths = () ; my @counts = () ; my @symbols = () ; for my $l (1..$#code) { if (my $ss = $code[$l]) { push @lengths, $l ; push @counts, scalar(@{$ss}) ; push @symbols, @{$ss} ; } ; } ; return [\@lengths, \@counts, \@symbols] ; } ; sub _huff_walk { my ($node, $length, $r_lengths) = @_ ; my ($p, $l, $r) = @$node ; if (ref($l)) { $length++ ; _huff_walk($l, $length, $r_lengths) ; _huff_walk($r, $length, $r_lengths) ; } else { $r_lengths->[$l] = $length ; } ; return ; } ; #===================================================================== +==================== # huffman_encode_map -- construct encode map from given encoding # # The encode map is an array indexed by symbol ordinal giving its Huff +man encoding. # # Symbols with no encoding will map to undef (of course). # # The huffman encoding is represented as an integer, with the LS bit b +eing the first bit # to transmit. Above the MS bit is a "guard bit". This particular re +presentation allows # an output loop of the form: # # while ($v != 1) { vec($m, $p++, 1) = $v & 1 ; $v =>> 1 ; } ; # # Requires: $code -- [ \@lengths, \@counts, \@symbols ] -- as re +turned by huffman() # # Returns: $r_emap -- ref:Array as described above sub huffman_encode_map { my ($code) = @_ ; my ($r_lengths, $r_counts, $r_symbols) = @{$code} ; my @emap = () ; my $hs = 0 ; # Huffman symbol value my $hl = 0 ; # Current huffman symbol length my $hc = 0 ; # Count of symbols remaining at curren +t length my $li = 0 ; # Index of next symbol length to use foreach my $s (@{$r_symbols}) { if ($hc == 0) { # If exhausted current symbol length, +advance to next my $l = $hl ; $hl = $r_lengths->[$li] ; $hc = $r_counts->[$li] ; $hs <<= ($hl - $l) ; # Increased symbol length $li++ ; # Index of next length to use } ; my $e = 1 ; # Start map entry with "guard bit" my $v = $hs ; # Copy of of huffman symbol, to be pro +cessed LS bit first for (1..$hl) { $e <<= 1 ; $e += $v & 1 ; $v >>= 1 ; } ; $emap[$s] = $e ; # Set map entry $hs++ ; # Next huffman symbol $hc-- ; # Count down symbols ar current length } ; return \@emap ; } ; #===================================================================== +==================== # huffman_enc -- encode given string into given bit-vector according t +o given encoding map # # Bit-Vector is updated, in place. # # Requires: $r_emap -- encoding map as generated by huffman_encode +_map() # $vec -- vector to encode to *** updated in place +*** # $p -- next bit of vector # $string -- string to encode # # Returns: new next bit of vector sub huffman_enc { my ($r_emap, undef, $p, $string) = @_ ; my $v ; foreach my $ch (split //, $string) { if (defined($v = $r_emap->[ord($ch)])) { while ($v != 1) { vec($_[1], $p++, 1) = $v & 1 ; $v >>= 1 ; } ; } else { die "No encoding for character '$ch'" ; } ; } ; return $p ; } ; #===================================================================== +==================== # huffman_dec -- decode one symbol from given bit-vector, using given +encoding # # The bit pointer is updated, in place. # # Requires: $code -- huffman encoding as returned huffman() # $vec -- vector to decode # $p -- next bit of vector *** updated in place +*** # # Returns: next symbol ordinal sub huffman_dec { my ($code, $vec, $p) = @_ ; my ($r_lengths, $r_counts, $r_symbols) = @{$code} ; my $hs = 0 ; # Huffman symbol my $hl = 0 ; # Current huffman symbol length my $hc = 0 ; # Count of symbols at current length my $li = 0 ; # Next length index my $v = 0 ; # What we've read so far do { $hs += $hc ; # Step to first symbol at next length my $l = $hl ; $hl = $r_lengths->[$li] ; # Next length $v -= $hc ; # Strip prefix for current length while ($l < $hl) { # Fetch bits to next length $v <<= 1 ; $v += vec($vec, $p++, 1) ; $l++ ; } ; $hc = $r_counts->[$li] ; # Count of symbols at new length $li++ ; # Index of next length to use } until ($v < $hc) ; $_[2] = $p ; # Update pointer -- in place return $r_symbols->[$hs+$v] ; } ; 1 ;

Replies are listed 'Best First'.
Re^2: Huffman coding in pure Perl
by Anonymous Monk on Dec 27, 2012 at 22:58 UTC

    Where is the final result of the code where to find it after dhe compile finish the work ?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://723710]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-16 09:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found