perlmeditation
demerphq
<p>
LZW is one of those algorithms that for some reason I was always afraid of.
Huffman encoding made sense, but LZW always seemed mysterious and arcane, something
that I would never be able to wrap my poor math skills around. Well, recently I had a look at it and it turns out to be quite approachable indeed. In fact, when you avoid some implementation specific details it becomes nearly simple. :-)
</p>
<i>Long</i><readmore>
<p>
LZW is interesting because its an adaptive technique, in that it doesn't require
two passes over the data, and it doesnt need to transmit a decoding table as part
of the compressed data. This issue causes static huffman encoding to be relatively
useless for small files so avoiding it is nice. The encoding table is deterministically
created during compression is such a way that during decompression the exact same table
is constructed.
</p>
<h3>So how does it work?</h3>
<p>
LZW works by encoding variable length input strings as variable length (in terms of bits)
numeric codes. Single chars are encoded as their ordinal value, with codes being assigned
to new substrings as they are encountered. The first char is emitted as simply that, an 8 bit
value. Following that point all codes are emitted in longer than 8 bit form (starting with
9 bits until the table grows to have 512 values, then to 10 bits etc.). Its important to keep
this point clear in your head. The code is the same, but can be different lengths.
</p>
In more detail, we try to find the longest already coded substring (starting at the
beginning) of the input stream. As soon as we find a substring that has not been coded
we emit the last good code value we found and at the same time add the unknown substring
to the table (and give it a code value as well). This leaves us with the single character
that caused us to miss. Since our table is initialized with the normal charset, we always
have a code for the remaining letter, and if we cant find a longer substring then that is
what we will emit. This behaviour means that we always emit enough known codes to reproduce
the new codes we are adding.
</p>
<p>
An example probably will help. Lets take the following string
<ul>
"ABAABAAAAABBBBBBBBAAAAAAABBBBBBAAAAAAAABBBBBBBBBB".
"AAAAAAAAAAAAAABBBBBBBBBBBBBAAAAAAAAAAA"
</ul>
when we run it through the compression we see output like this:
<code>
O > char : "A"
READ:
"BAABAAAAABBBBBBBBAAAAAAABBBBBBAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAABBBBBBBBBBBBBAAAAA".
"AAAAAA\r\n"
C>> 257 : "AB"
O > 66 : 001000010 : "B" | C>> 258 : "BA"
O > 65 : 001000001 : "A" | C>> 259 : "AA"
O > 257 : 100000001 : "AB" | C>> 260 : "ABA"
O > 259 : 100000011 : "AA" | C>> 261 : "AAA"
O > 261 : 100000101 : "AAA" | C>> 262 : "AAAB"
O > 66 : 001000010 : "B" | C>> 263 : "BB"
O > 263 : 100000111 : "BB" | C>> 264 : "BBB"
O > 264 : 100001000 : "BBB" | C>> 265 : "BBBB"
O > 263 : 100000111 : "BB" | C>> 266 : "BBA"
O > 261 : 100000101 : "AAA" | C>> 267 : "AAAA"
O > 267 : 100001011 : "AAAA" | C>> 268 : "AAAAB"
O > 265 : 100001001 : "BBBB" | C>> 269 : "BBBBB"
O > 266 : 100001010 : "BBA" | C>> 270 : "BBAA"
O > 267 : 100001011 : "AAAA" | C>> 271 : "AAAAA"
O > 262 : 100000110 : "AAAB" | C>> 272 : "AAABB"
O > 269 : 100001101 : "BBBBB" | C>> 273 : "BBBBBB"
O > 265 : 100001001 : "BBBB" | C>> 274 : "BBBBA"
O > 271 : 100001111 : "AAAAA" | C>> 275 : "AAAAAA"
O > 275 : 100010011 : "AAAAAA" | C>> 276 : "AAAAAAA"
O > 272 : 100010000 : "AAABB" | C>> 277 : "AAABBB"
O > 273 : 100010001 : "BBBBBB" | C>> 278 : "BBBBBBB"
O > 269 : 100001101 : "BBBBB" | C>> 279 : "BBBBBA"
O > 276 : 100010100 : "AAAAAAA" | C>> 280 : "AAAAAAAA"
O > 267 : 100001011 : "AAAA" | C>> 281 : "AAAA\r"
O > 13 : 000001101 : "\r" | C>> 282 : "\r\n"
</code>
<p>
The first line indicates that the first char was output literally, we dont need nine bits
for it. (In fact we really could output the first two bytes verbatim. <Code>compress</code>
doesnt however and the above is meant to be compliant with that.) The second to fourth lines
obviously shows us whats left of the input, the fifth line shows that the string "AB" was added to
the code table with id #257. The sixth line shows that code 66 was emitted, and that the string
"BA" was added to the code table with id #258. Followed by code 65 being emitted and the string
"AA" being added to the symbol table...
</p>
<p>
This sequence demonstrates the reason why no explicit table of encodings has to be transmitted.
The first data emitted was 'A' followed by code 66 (for 'B') followed by code 65. And the new
entries to the table are 'AB' 'BA' (well overlook 'AA' here for a moment). Each new entry is
constructable by known data we have already received. It turns out however that there is a
special case where we get confused on decompression: when the pattern matches KwKwK (where K is a code,
(and thus represents a string) and w represents a character) the compression routine will emit
a code before the decompression will have defined it. Luckily we still have sufficient information
to reconstruct the code, so a special case handler can take care of it. The simple input string
'AAABBB' illustrates this:
</p>
<code>
O > char : "A"
READ:"AABBB\r\n"
C>> 257 : "AA"
O > 257 : 100000001 : "AA" | C>> 258 : "AAB"
O > 66 : 001000010 : "B" | C>> 259 : "BB"
O > 259 : 100000011 : "BB" | C>> 260 : "BB\r"
O > 13 : 000001101 : "\r" | C>> 261 : "\r\n"
O > 10 : 000001010 : "\n" |
</code>
<p>
We first read in and output the char 'A', we then create the code 'AA' and then output
it. That will trip up the decompression. However we can use the old code value and the old last
char read to reconstruct things as the below shows:
</p>
<code>
First char: A
Read: 257 undef
Initializing current string for 257 ("A") from "A" and ""
E<< Created: 257 : "AA"
Read: 66 "B"
E<< Created: 258 : "AAB"
Read: 259 undef
Initializing current string for 259 ("BB") from "B" and "B"
E<< Created: 259 : "BB"
Read: 13 "\r"
E<< Created: 260 : "BB\r"
Read: 10 "\n"
E<< Created: 261 : "\r\n"
</code>
<h3>Dynamic encoding</h3>
<p>
One tricky bit of the compression is bit twiddling to handle the variable length
output codes. In C this is a nasty mess of binary operators. In perl we can do it
a lot more easily (but obviously less efficiently) by keeping our bit pattern in
string format and then using <code>pack</code> to convert it to bytes at a later time.
This means that we dont have to worry about bit offsets, we can simply concatenate the
stringified form of the code to a buffer and let pack sort out the byte boundaries as
it needs to. Likewise for reading the data in, we read in a chunk of data, convert it
to a bit string, and then cut off chunks at the front as we go, converting the appropriate
length back to a number for processing purposes.
</p>
<h3>Table OverFlow</h3>
<p>
Even though the underlying algorithm is dynamic, there does come a point where it becomes
unfeasable to keep increasing the table size. LZW propper doesnt really say what to do in this
case. <code>compress</code> however, will cease growing the table and simply output codes
it knows, until the compression ration starts to degrade, whereapon it flushes the table and
begins anew. This is where the special code #256 comes in. When this code is read by the
decompressor it will wipe its string and code tables, so the compressor outputs it before it
wipes the table. (It has to do it before otherwise the wrong number of bits are read on the
next <code>input_code()</code> )
</p>
<p>
If compatibility to <code>compress</code> were not an issue then other options are available.
I was thinking that switching to dynamic huffman encoding of the table when it fills up would
be an interesting twist. Unfortunately ive not been able to get Vitters algorithm running in
Perl (yet) and I havent yet found a good overview of Algorithm FGK to try it instead either.
</p>
<h3>Implementation</h3>
<p>
My implementation is fairly straight forward. The key routines are <code>compress_fh</code>,
<code>output_code</code>, <code>expand_fh</code>, and <code>input_code</code>. The two
"_code" routines handle the input buffers and the output buffers as well as for ensuring
the correct number of bits per code are output or input. The other two routines handle
the table maintainence and the rest.
</p>
<p>
The other routines are fairly straight forward, with the exception possibly of
<code>_init()</code>. <code>_init()</code> is called when the table is wiped, as well as
when compression or decompression begins (by init()).
</p>
<p>
One implementation detail that i've not bothered with is a way to deal with a serious deficiency
in the LZW algorithm: it can potentially require a lot of memory. This is dealt with by observing
that every code (above 256) is represented by another code and a character. So in the string tables
(a hash in our case $self->{strs}), we need not store full strings for each code, but only the previous
code plus the new char. This brings the size of the string tables down considerably but represents
more added complexity than I felt was necessary for learning the algorithm. If this was added it
would mean minor changes to the compression, but considerable changes to the decompression, requiring
a walk thought the code/char combinations, and then reversing the resulting list for every output string.
</p>
<h3>Implementation Bugs</h3>
<p>
The way I am handling the output bit stream differs from <codE>compress</code>. Ive not
been able to work out what to do to make my current code compatible with compress in this way.
From what I can tell the code below behaves pretty much the same as <Code>compress</code>,
producing the same file sizes etc, but for some reason a different bit code is being output.
Anybody have any ideas on that, let me know.
</p>
<h3>Other Implementations</h3>
<p>
One of our [hatmoward|own] has placed [cpan://Compress::LZW] online. His variation is somewhat closer
to the "pure" LZW implementation, in that it doesnt do dynamic code sizes (the algorithm
can output specific sizes, but they are fixed at start time). I havent benchmarked or done
any further investigation his code (mostly becuase I had written most of mine before I even
saw his).
</p>
<p>
Thanks to [diotalevi] for his comments [id://267422|here] which lead me to investigating
<code>compress</codE> in detail and implementing most of its behaviour (if only I could get the code
output right...). And of course [hatmoward] for publishing his version as well :-) [castaway] was very kind in helping me with source files, and [Zaxo] pointed me at a working copy of <Code>compress</code>, (<Code>ncompress</code> from Debian), which helped a lot. Thanks.
</p>
</readmore>
<p>
See also [cpan://Compress::LZW], [id://266984], [id://267408], [http://dogma.net/markn/articles/lzw/lzw.htm|Dr Dobbs on LZW],
[http://www.cs.duke.edu/~jsv/Papers/Vit89.algojournalACMversion.pdf|Vitters Dynamic Huffman Algorithm 673],
[http://www.ics.uci.edu/~dan/pubs/DC-Sec1.html|Data Compression] and of course the accompanying source code which will be in [LZW Demystified (the code)].
</p>
<br />
---
<br />
demerphq<br />
<br />
<sub><[Elian]> And I do take a kind of perverse pleasure in having an OO assembly language...
<!--
<hr />
<p>
<strong>• Update: </strong><br />
</p>
-->
</sub>
<br />