The Elegance of Deflate August 21st, 2016
A little while back I found need of a PNG loader for a small project of mine. Being a complete tool I of course decided to write my own -- after all, why save yourself effort when there are still wheels waiting to be reinvented? You can check the source to the inflater code here if you so like - it's quite cleanly written. In fact I wrote it specifically to be easy to read, rather than to be the fastest implementation.
I didn't know much about Deflate at the time. I knew it was based on the LZ family of algorithms (LZ77/L7SS etc). I'd heard anecdotally that it was "just" LZSS except they applied Huffman coding on the match vectors. Well, it turns out that's kinda true. Kinda. But as I read deeper into the specification, it stuck me that something really quite clever was going on here, something I hadn't seen anyone explicitly call out before. So I figured why not try and explain the essence of Deflate here. I'm not going to cover the whole workings in exacting detail; if you want that go read the spec.
Deflate was invented by Phil Katz back in 1993, and forms the basis of the ZIP file format, the zlib library, gzip, PNGs, and probably a whole bunch of other stuff. At the time it was pretty cutting-edge. The main competition back then was usually LZW, or maybe LZSS. In many people's eyes compression was still synonymous with run-length encoding, so when Deflate came along it definitely turned some heads. Now this was over 20 years ago, and so later codecs like LZMA (which I may one day write about) can regularly beat it by a healthy margin. Still, it ain't dead yet.
So the basic principle described here was also used by the LHA archiver, which predates PKZIP 2.0 by a little. Perhaps I should arguably be titling this article "The Elegance of LHA"? Well I'm not going to, so there. History is written by the victors, and the guys with the better acronyms.
There's two basic approaches to compression. OK no, that's not true, but let's go with it anyway and press onwards. To be honest if you know anything about compression already you may well be glossing over much of this, but here it is regardless.
- Entropy based compression
- Dictionary based compression
Entropy based compression is a very old idea. Computers usually store letters as 8-bit; one byte for each letter. "What a waste," someone thought. Some letters like E and T are very common, while letters like Q and Z rarely come up at all. It makes sense that you'd want to use less bits for the letters that are more common.
This idea even predates computers. Go look at Morse code -- the two most common letters in English (E and T) are just a single dot and dash.
Huffman coding is a very old and simple way to apply this. You start with an "alphabet" - the set of all the symbols you want to encode. In English at school we're taught that our alphabet is 26 letters. In computers however we need a little more. We need to store upper-case letters, lower-case letters, numbers, punctuation, we need a symbol for a 'space', and a few other little things. Once you have your full alphabet you need to assign probabilities. We know some letters (e/t/a/i/etc) are more common than others (q/x/y/z/etc), but we also have to take the rest of this expanded alphabet into account. Numbers are rare in English prose, and so are capitals. This paragraph has five-hundred and sixty-one lower-case letters in, but only ten capitals.
Once you have your probabilities all sorted out, you build a Huffman tree. This assigns a unique code to each letter, except in a way that's "self-terminating". i.e. you don't need to store any additional information to say where each letter starts and stops. If you look at the example of Morse code, it doesn't work like that. While Morse has dots and dashes, it also has gaps, and the gaps are needed to split the letters up. Huffman coding needs no gaps.
So that's entropy encoding -- you decide upon an alphabet, you use less bits to store the common things and more bits for the rare things. You can compress things like English language with it, and while it works fine it leaves a lot of room for improvement. The are better ways than Huffman coding of course, like arithmetic coding, but Huffman's the easiest to get your head around. It's also what Deflate uses, so we'll leave it there.
LZ77 forms the basis of the dictionary-based algorithms. The idea of dictionary compression is simply that certain things come up more than once in a document. For instance, if you search this page for the word "compression" it'll appear several times. So rather than store it again each time, let's try and refer to the previous usage.
LZ77 worked very simply. You store the match vector (a pair consisting of length and distance), and following the match you store the letter (the "literal") which ended the match.
So for example, if I want to store the phrase
sense and sensibility, we see that the
sens part is used twice. So we might store the second usage as
go back 10, copy 4, letter i, because we need to copy the 4 letters of "sens" and then stick a new i on the end.
Seems fine, right? Nope, LZ77 sucked. Although it could save space by referring to previous phrases, that was all it could do. The pattern was fixed - it stored a match, followed by a letter. But what if there was no match? To use the same example, here's how LZ77 would actually store that whole quote:
Input: sense and sensibility
Output: [back 0, copy 0]s[back 0, copy 0]e[back 0, copy 0]n[back 0, copy 0]s[back 0, copy 0]e[back 0, copy 0] [back 0, copy 0]a[back 0, copy 0]n[back 0, copy 0]d ...
What a mess. You get the idea though. It has to store a match, even if there isn't one. And storing that empty match information takes space. A typical vector might be 16 bits (12 bits for the distance back, 4 bits for the length). So something that was previously 8 bits per letter could now end up as 24 bits per letter! This is why no-one uses raw LZ77.
So they invented LZSS, which is what many of the early archivers used. LZSS stores a single bit to say whether the next thing coming up is a match vector, or a literal symbol. It's such a painfully obvious idea I'm at a loss as to why it wasn't thought of sooner. So now our extra marker bit allows us to flag whether a match vector is empty, and simply skip all that wasted overhead.
Anyway, that's dictionary compression. It works very well, when it's able to get lots of good matches going. It's that extra bit, though. That pesky "literal" bit we use to indicate whether we got a match. What happens if there aren't any matches? In that case, we've gone from originally storing 8 bits per letter to storing 9 bits! If no matches occur, LZSS will store 1 bit to indicate a literal letter, then 8 more bits to store that letter!
LZ77, LZSS and others in the family all suffer from the same problem, the overhead of having to switch modes -- when they find something that doesn't fit into the class of dictionary-based matches, they need to write out some kind of marker to switch over to the literal mode. This all takes up valuable space in the stream. (if you're not careful, sometimes you'll waste so much overhead doing this kind of switching that you won't actually compress the file at all, you'll enlarge it!)
So we have two compression algorithms. LZSS is reliant on finding previous data to match against, and Huffman coding is reliant on some letters being more common than others. Can we do better than picking one of those two? Can we weave them together?
Yes we can.
If we want to munge these two algorithms together, it's not a great leap to imagine how we might do it. We start from LZSS right? The matching part seems OK, it's just the literals that are bogging us down as they're completely uncompressed. So how about this -- we could just Huffman encode them? We store a bit to say if it's a match or a literal, and if it's a match then we write a 16-bit match vector, and if it's a literal then we write a variable-length Huffman-encoded symbol.
That'd work just fine I suppose, but it's not how Deflate does it. Deflate goes further than that. You're still paying the cost of that marker bit to switch modes. In order to understand Deflate we need to expand our idea of what an alphabet is.
You see, when we hear the word "alphabet" we think back to what we learned as children -- A,B,C, etc. But the moment we start getting involved in compression we know that isn't the case. We have already had to expand our alphabet to include upper-case, and numbers and such. Even spaces, which aren't even visible, we have to encode those too. Deflate just takes this concept one step further.
Deflate is based around the idea of the unified alphabet. If an alphabet is just a set of choices, why not bring all the choices together under one umbrella? Deflate's alphabet consists of 286 symbols. The first 256 are the ASCII codes for each letter, including all the ASCII control codes and other such. The remaining 30 symbols are used to represent lengths. That's right, we're storing the actual match length here. Here's the actual table used:
|0-255||Regular input byte|
|256||End of data block|
|257||Match of length 3 (distance follows after)|
|258||Match of length 4 (distance follows after)|
|259||Match of length 5 (distance follows after)|
|260||Match of length 6 (distance follows after)|
|261||Match of length 7 (distance follows after)|
|262||Match of length 8 (distance follows after)|
|263||Match of length 9 (distance follows after)|
There's a bit more to it than just that, which I've omitted here for clarity, but hopefully this should explain the principle. We have 286 symbols now, and each one will be assigned its own probability, and therefore its own Huffman code. We know that some letters are still going to be more common (E, T, etc). But what about the matches? How common are they?
Remember how we had the frequency diagram of English text earlier? Well here's how it looks with our expanded alphabet:
You can see that the lower-case 'e' is popular, as are some of the other lower-case ones like 'a/i/t/etc'. Capitals are almost non-existent, and there are no number digits at all in this case. Yet something sticks out like a sore thumb -- the dictionary matches which dominate the probability domain. It's also interesting that the probabilities of the lower-case letters have changed, due to the dictionary matches evening it out.
As we compress the data one symbol at a time, we face a choice at each stage of compression. If we can find a previous match then we can just write that match vector out, and if we don't then we write the literal symbol. But the special beauty of Deflate that sets it apart from its predecessors, is that whatever option we pick, we're only ever writing out a single Huffman-coded symbol.
If we want to store two matches in a row, or two literals in a row, we can do that. One alphabet to store two different ideas. No bit markers needed to designate a switch between modes. We're storing literal letters compressed according to their frequency, we're also storing match lengths according to their frequency, but more than that we're doing both of those things under one unified scheme.
It's not one algorithm but two, that dance and weave together in harmony. You can pick dictionary matching, or entropy coding, and you can select between them on a per-byte basis with no overhead. That's what I find so clever about it - not that it can do one thing or the other, but that it can choose either, and yet represent them using the same language. No extra markers, nothing to say "oh, now we're switching to entropy mode", nothing to get in the way.
It's an RLE encoder, it's a dictionary matcher, it's a frequency table, and it's all of these things with no barriers in-between, no modes to switch nor separate passes to run.
Deflate remains an excellent example of how two algorithms can come together and play off one another, rather than fighting against themselves.
I gotta admit, I'm impressed.
Written by Richard Mitton,
software engineer and travelling wizard.
Follow me on twitter: http://twitter.com/grumpygiant