http://qs321.pair.com?node_id=59720


in reply to Huffman encoder

Cet assombrissement est soumis au nom de Paris.pm canal assombri.

This program perform a Huffman compression to its STDIN or to the files given on the command-line. One drawback is that I was too lazy to write the code to output the compressed data of different files to different output files...

HOW DOES IT WORK?

First $/ and $" are undefined, so that the first use of the diamond operator slurps the whole file, and that the double-quoting of array(slice)s don't result in unwanted spaces. map is used instead of for because I like returning lists in a void context. Furthermore, it saves a few ()'s.

Stéphane Payrard (of OPC3 fame) taught me that @0 is a perfectly valid array. That's a good enough reason to use it. The same goes for %0.

Thanks to O::Deparse, I learned a few tricks like writing $a->[1] as $$a[1]... (Take this as a proof that one can learn something through obfuscation.) Naturally, I doubt that O::Deparse helped you much, since here it spits out:

Can't call method "sibling" on an undefined value at /usr/lib/perl5/5.6.0/i386-linux/B/Deparse.pm line 257.
CHECK failed--call queue aborted.

I must admit I don't know at all how it is possible. But, hey that's one more hurdle on your way to understanding this program.

So, %0 holds the count for each character (the character is the key, the count is the value). Given the character, we create a array containing one array by character, containg: one sub-tree of the Huffman tree (at first, it's only a leaf (i.e. the character itself)), its weight, and a string containing all the sub-tree leaves.

THE HUFFMAN TREE

The Huffman algorithm pairs the sub-trees by weight. As long as there are more than one sub-tree, pairs the lightest two into one sub-tree. I chose to have the lightest branches on the left (bit 0).

            49           For a file containing: 27 A
            /\                                  15 B
          22  A                                  3 C
          /\                                     1 D
         7  B                                    1 E
        /\                                       1 F
       C  4                                      1 G
         /\
        2  2             the resulting tree is (with the weight of each
       /\  /\            sub-tree noted at its root) drawn on the left.
      D  EF  G

This tree is encoded at the beginning of the file in the following manner:

            01           Two bits are used by node (in the previous example,
            /\           there are 6 nodes and 7 leaves), one for each branch.
          01  A
          /\             0 means non-terminal node and 1 means terminal node
        10  B            (i.e. there is a leaf hanging there).
        /\
       C 00              Our tree looks like the one on the left.
         /\
       11  11            The recursive procedure that builds the tree
       /\  /\            represents the data in the following manner:
      D  EF  G           [left-leaf][sub-tree][right-leaf]

Applying this procedure (named _) recursively, we obtain: 010110C0011DE11FGBA (every character being replaced with its bit representation).

SO IT'S JUST A REGEX?

While we build the tree (line 2 and 3), we also create a HASH %t which keys are the characters in the file to be compressed and which values are the bit string (the path in the tree). This path is constructed as we creare the tree itself. $; and $/ are used to create the needed 0's and 1's... $/ equals 2 (whose modulus helps), and $; is first undefined and is an even number each time we need it again in our map.

Finally, we perform a substitution on the data, replacing each character by the bit string that represents the path in the tree. The compressed file is the concatenation of the length of the data (since pack will pad the file with null bits, we need this information for the decompressor), the dictionary (which is self-explaining: no need for its size), and the data itself.

FINAL NOTE

This program was designed to have as little distinct characters as possible, and also so as some character appear really a lot. It was too difficult for me to write it so that the respective probability of each character would be a negative power of 2 (1/2**$k if you like), which is when Huffman encoding performs at its best. My goal was to write a compressor / decompressor all in one obfuscation. Alas, Huffman doesn't perform well enough (it's that lousy dictionary's fault!). OK, next time I'll try with Lempel-Ziv...

Post-Scriptum:

crazyinsomniac asked me to add this link to a site with explanations and a Java animation.