EE274: Data Compression, course notes

Welcome! This e-book serves as lecture notes for the Stanford EE course, EE274: Data Compression. This set of lecture notes is WIP, so please file an issue at https://github.com/stanfordDataCompressionClass/notes/issues if you find any typo/mistake.

Lossless data compression

The first part of the lecture notes pertain with lossless data compression

  1. Introduction
  2. Prefix Free Codes
  3. Kraft Inequality
  4. Entropy and Neg-log likelihood thumb rule
  5. Huffman coding
  6. Asymptotic Equipartition Property
  7. Arithmetic coding
  8. Asymmetric Numeral Systems

Introduction to lossless coding

Data compression, is one area of EECS which we are aware of and unknowingly make use of thousands of times a day. But, not many engineers are aware of how things work under the hood. Each text, image and video is compressed before it gets transmitted and shared across devices using compression schemes/formats such as GZIP, JPEG, PNG, H264 ... and so on. A lot of these compression schemes are somewhat domain specific, for example typically use GZIP for text files, while we have JPEG for images. While there is significant domain expertise involved in designing these schemes, there are lots of commonalities and fundamental similarities.

Our focus in the first half of the course is going to be to understand these common techniques part of all of these compression techniques, irrespective of the domain. The goal is to also build an intuition for these fundamental ideas, so that we can appropriately tweak them when eventually designing real-life and domain-specific compression techniques. With that in mind, we are going to start with very simple hypothetical but relevant scenarios, and try to tackle these. As we will see, understanding these simpler cases will help us understand the fundamental details of compression.

Let's start with a simple example:

Let's say we are given a large random text file consisting of 4 letters . To be more precise, each letter of the text file is generated randomly by picking a letter from the set independently and with uniform probability. i.e.

An example of such a file is given below. Let's say we have such a file containing letters.

ACABDADCBDDCABCDADCBDCADCBDACDACDBAAAAACDACDA...

Now if one examines this file, then the size of the file is going to be 1,000,000 bytes, as can be seen below.

➜ ls -l sample_ABCD_uniform.txt 
-rw-rw-r-- 1 kedar kedar 1000000 Jun 23 19:27 sample_ABCD_uniform.txt

This is quite obvious as we are using 1 byte (or 8 bits/symbol) to represent each alphabet. In fact, if we look at the bits in the file, we see that each symbol has a unique codeword, given by the ASCII table:

SymbolASCII code
A1000001
B1000010
C1000011
D1000100

Every time we write text to a file, the programming language encodes the text file using ASCII encoding table -- to map the characters and other punctuations to binary data. Also, when we read the file back, the binary data is read 8 bits at a time and then decoded back to the character.

Now, the question we want to answer is "Can we reduce the size of the file?"

To be more precise, we will restrict ourselves to Symbol codes, which are encoding schemes where we have a unique binary string (called a codeword) for every symbol.

Quiz-1: Fixed Bitwidth Encoding

For the example above where our file contains random letters from the set , the ASCII code gives us a compression of 8 bits/symbol. Can we design a symbol code which achieves better compression and is lossless, i.e. a lossless compression scheme?

Fixed Bitwidth encoding

If we look at the ASCII code table, we see that all the codes start 1000, which is kind of redundant. It is useful if we have other letters in the file, but in this case we only have 4 letters. Thus, we don't need all the 8-bits! In fact, we can just use 2 bits/symbol to represent them all as follows:

SymbolASCII code
A00
B01
C10
D11

Note that this is lossless compression, as during decoding as well we can just read 2 bits at a time and map them to A,B,C,D. Thus, this simple idea has given us a compression of 4x!

In general, if we have different symbols in our text file, then using this simple idea we can represent/encode the data using bits.

Going beyond Fixed Bitwidth encoding

Now, this 2 bits/symbol compression is all well and good, but works only if some letters are fully absent from our text file. For example, if now our file the letters were absent and only the letters were present, then we could go from 2 bits/symbol to 1 bit/symbol, by assigning codes a follows:

SymbolASCII code
A0
B1

But, what if were present, but in a smaller proportion? Is it possible to achieve better compression? Here is a more precise question:

Quiz-2: Beyond Fixed Bitwidth encoding

Lets say that our random text file contains the letters , such that each letter is sampled independently with probability: Then can we design a symbol code which achieves lossless compression better than 2 bits/symbol?

Clearly, if we stick with fixed bitwidth encoding, then irrespective of the distribution of letters in our file, we are always going to get compression of 2 bits/symbol for all the 4 symbols, and so of course we can't achieve compression better than 2 bits/symbol. But, intuitively we know that most of the time we are only going to encounter symbol or , and hence we should be able to compress our file closer to 1 bit/symbol. To improve compression, the idea here is to go for variable length codes, where the code length of each symbol can be different.

But, designing variable length codes which are also lossless is a bit tricky. In the fixed bitwidth scenario, the decoding was simple: we always looked at 2 bits at a time, and just performed a reverse mapping using the code table we had. In case of variable length codes, we don't have that luxury as we don't really know how many bits does the next codeword consists of.

Thus, the variable length codes need to have some special properties which makes it both lossless and also convenient to decode.

Instead of just mentioning the property, we are going to try to discover and reverse engineer this property! With that in mind, I am going to present the solution to the Quiz-2, but what you have to do is figure out the decoding procedure! So here is one possible code:

Symbolcodewords
A0
B10
C110
D111

Some questions for you to answer:

Quiz-3: Variable length symbol code

For the variable code above, answer the following questions:

  1. Are we doing better than 2 bits/symbol in terms of the average compression?
  2. Is the code really lossless?
  3. How can we perform the decoding?

Even before we try to answer these questions, lets try to first eyeball the designed code and see if we can understand something! We see that the code for is now just 1 bit instead of 2. At the same time, the code lengths for are 3 instead of 2. So, in a way, we are reducing the code length for symbols which are more frequent, while increasing the code length for symbols which occur less frequently, which intuitively makes sense if we want to reduce the average code length of our file!

Compression ratio: Lets try to answer the first question: "Are we doing better than 2 bits/symbol in terms of the average compression?" To do this, we are going to compute , the average code length of the code:

So, although in the worst case this increases the symbol code length from 2 to 3, we have been able to reduce the average code length from 2 to 1.53, which although is not as good as 1 bit/symbol we see when were completely absent, it is still an improvement from 2 bits/symbol. Great!

Lossless Decoding: Now that we have answered the first question, lets move to the second one: "Is the code really lossless?"

Okay, lets get back to the code we have:

Symbolcodewords
A0
B10
C110
D111

Let's take an example string and see how the encoding/decoding proceeds:

input_string = ABCADBBA

The encoding proceeds in the same way as earlier: for every symbol, we look at the lookup table and write the corresponding bits to the file. So, the encoding for our sample string looks like:

input_string = ABCADBBA
encoding = concat(0, 10, 110, 0, 111, 10, 10, 0)
= 010110011110100

Now, lets think about how we can decode this bit sequence: 010110011110100. The issue with using a variable length code, unlike before where we used 2 bits for each symbol is that we don't know where each code ends beforehand. For example, we don't know how to decode symbol no. 7 directly. but let's see how we can decode the input sequentially, one symbol at a time.

  1. The first bit of 010110011110100 is 0. We see that only one codeword (corresponding to ) starts with 0, so we can be confident that the first letter was an ! At this time we have consumed 1 bit, and so to decode the second symbol we have to start reading from the second bit.

  2. Let's now focus our attention on the remaining bits: 10110011110100. Now, the first bit is a 1. Unfortunately have codewords which start with a 1, so lets read another bit: 10. Okay! we have a match with the symbol . We can also be confident that the second symbol is indeed , as no other codeword begins with 10. Thus, our decoded string until now is A,B

  3. We can continue in the same way and decode the remaining symbols one at a time, by following the following procedure, to recover the input string: ABCADBBA.

Below is a pseudocode to decode a single symbol using the matching scheme described above.

codewords_table = {A: 0, B: 10, C: 110, D: 111}


def decode_symbol(encoded_bitstring, codewords_table):
    temp_bits = ""

    # read until we find a match with a codeword in the codewords_table
    while not find_match(temp_bits, codewords_table):
        temp_bits += encoded_bitstring.read_next_bit()

    decoded_symbol = find_match(temp_bits, codewords_table)
    return decoded_symbol

Note, that one reason why our decoding procedure works is because no codeword is an initial segment of another codeword, i.e. no two codewords are prefixes of each other. Thus, if codeword for is 10, then no other codeword can begin with 10. This kind of Symbol codes are known as prefix-free codes (or simply Prefix Codes). As we just saw, we can confidently say that if a code is * prefix-free*, then it indeed leads to lossless compression, and a convenient decoding procedure.

Note that prefix-free codes aren't the only way of designing symbol codes which are lossless. For example, consider the following code

Symbolcodewords
A0
B01
C011
D111

Notice that codeword for starts with 01 which is in fact the codeword for (similarly for and ). Thus, this code is not prefix-free and our decoding procedure described above will fail. For example if we encode a -> 011..., our decoding procedure will always output a , as it would find a match at 01 itself. Note that although our simple decoding scheme above fails, lossless decoding is still possible. Although, the decoding scheme might be more complicated as, we have to read "future bits" to determine if a or a was encoded here.

As a fun exercise, can you try to figure out how the decoding works for this code?

Quiz-4: Non-prefix free code example

Can you think of a decoding procedure which works for this code? (Hint: It sometimes helps to think in the reverse direction)

Symbolcodewords
A0
B01
C011
D111

To summarize: the key idea to learn from the design of our code is that

Key-Idea 1: Prefix-free codes

If no codeword in a symbol code is a prefix of each other, then the symbol code is a prefix-free code.

Prefix-free codes by design are lossless compression schemes which allow for a convenient decoding scheme.

Another name for prefix-free codes is instantaneous codes as the decoding procedure is in a sense "instantaneous", i.e. we don't have to read bits written by future symbols to decode the current symbols.

Next Lecture

In this lecture, we discussed the design of a simple prefix-free code which improves upon the compression of a non-uniform source from 2 bits/symbol to 1.53 bits/symbol by using variable length codewords. We also briefly discussed how the decoding might proceed.

In the next lecture we will take this discussion further and try to answer the following questions:

  1. Implementing the decoding: We will see how to transform our simple decoding procedure into an efficient algorithm, which we can implement in practice

  2. Designing prefix-free codes: We will discuss how to design prefix-free codes to obtain better compression.

Prefix(-Free) Codes

As a recap, in the previous lecture, we discussed a simple example of a prefix-free code. We also discussed a simple procedure for decoding data encoded using the prefix-free code. In this lecture we will be thinking about how to actually go about implementing the decoding, and also how to design prefix-free code itself.

Lets start by thinking about the decoding. As a recap, here is our prefix-free code and the decoding procedure

SymbolCodewords
A0
B10
C110
D111
### Prefix-free decoding procedure
codewords_table = {A: 0, B: 10, C: 110, D: 111}

def decode_symbol(encoded_bitstring, codewords_table):
    temp_bits = ""

    # read until we find a match with a codeword in the codewords_table
    while not find_match(temp_bits, codewords_table):
        temp_bits += encoded_bitstring.read_next_bit()

    decoded_symbol = find_match(temp_bits, codewords_table)
    return decoded_symbol

The procedure is as follows:

  • We start from the encoded bitstream, and try to find a match with either of the codewords. We stop when we get our first match
  • Note that the decoding procedure above is "correct" because we have a prefix-free code, and so we can stop searching when find_match obtains its first match.

Let's think about how we can implement the find_match function.

Idea-1: Hash-tables: One simple way to implement the find_match function in the decoding is via a hash table. We can just create a hash table using the codewords_table. We can then just query the hash table with the temp_bits, until we get a match.

Can we think of a simple data-structure which will make this procedure faster?

Prefix-free Tree

Another approach towards prefix-free code decoding is to construct a prefix-free tree and use this tree for decoding. Lets get back to our prefix-free code example to see what I mean by that:

Given the codewords table, we can represent it as a binary tree follows:

graph TD
  *(Root) -->|0| A:::endnode
  *(Root) -->|1| n1(.)
  n1 -->|10| B:::endnode
  n1 -->|11| n2(.)
  n2 -->|110| C:::endnode
  n2 -->|111| D:::endnode
  1. The idea behind the prefix-free tree construction is simple. For each codeword, we add a node at depth len(codeword) from root node, taking the right path or the left path from previous node in the tree depending on whether the corresponding bit in the codeword is 1 or 0. E.g. in our codewords table B -> 10. In this case we add a node to the binary tree at depth 2 = len(10), corresponding to the path 10 -> right, left from the root node. Similarly, for C -> 110, we add a node at depth 3 corresponding to the path 110 -> right, right, left.
  2. Notice that for prefix-free codes, the codewords correspond to the leaf nodes. This can be shown using contradiction. If there was another node n2 corresponding to codeword c2 sprouting out of node n1 (with codeword c1), then based on the construction of prefix-free tree defined in the previous step, c1 is a prefix of c2. This is a contradiction because we're violating the prefix-free property of our code. In fact, the property that prefix-free codes correspond to the leaf nodes of prefix-free tree is another way to define prefix-free codes!

Okay, now that we have understood how to create the prefix-free tree data structure for any prefix-free code, can we use this tree structure to improve the decoding algorithm? Take a look!

## Prefix-free code decoding using a prefix-free tree

prefix_tree = ...  # construct the tree here using codewords

def decode_symbol(encoded_bitarray, prefix_tree):
    # start from the root node and read bits until you read a leaf node
    node = prefix_tree.root
    while not node.is_leaf_node:
        bit = encoded_bitarray.read_next_bit()
        node = node.right if bit else node.left
    # return the symbol corresponding to the node, once we reached a leaf node
    return node.symbol

Some observations:

  1. This decoding scheme is similar in logic to the previous one, but quite a bit more efficient, as we are not querying the hash table multiple times.
  2. The key idea is that for prefix-free codes the codewords correspond to leaf nodes. Hence, we just parse the tree with the bits from the output until we reach one of the leaf nodes.
  3. Note that we need to create the prefix-free tree during the pre-processing step of our decoding as a one-time cost, which typically gets amortized if the data we are encoding is quite big.

As we will see later the prefix-free tree is not just useful for efficient decoding, but is a great way to think and visualize prefix-free codes. We will in fact use the structure when we learn about how to design a good prefix-free code.

How to design good prefix-free codes?

Okay, now that we have convinced that prefix-free codes are indeed lossless and that they have an efficient decoding scheme, lets revisit our code and think again about why the scheme leads to better compression

To recap: we wanted to design a code for the skewed non-uniform distribution:

we started with the following Fixed Bitwidth code.

SymbolCodewords
A00
B01
C10
D11

Then we were able to construct a variable-length code, which is indeed lossless and improves the compression from 2 bits/symbol to 1.53 bits/symbol on an average.

SymbolCodewords
A0
B10
C110
D111

How did this improvement come about?

  1. The Fixed Bitwidth code is assigning equal importance to all the symbols , as the code-lengths are the same ( 2 bits/symbol). This seems all good, in case they are equiprobable. i.e. if the probabilities are:

  2. But in case of the skewed probability distribution, clearly are more important as they occur much more frequently. So, we should try to assign a shorter codeword to and could afford to assign longer codewords to as they occur much less frequently.

Thus we have a simple and somewhat obvious thumb rule for a code:

Key-Idea 2

Given a distribution, it is better (in terms of compression) to assign shorter codewords to symbols with higher probability.

Even though we want to assign shorter codes to symbols with higher probability, it is not clear what the proportionality should be.

For example, the prefix-free code we have works well for distribution and gives us the average codelength of 1.53 bits/symbol. But, the same code doesn't work well for a less skewed distribution like:

as in this case the average codelength is 2.1 bits/symbol (which is even higher than Fixed Bitwidth code!).

This problem, (and much more) was definitively analyzed in the brilliant work by Claude Shannon in his 1948 paper A Mathematical theory of Communication. Shannon's work laid the theoretical foundations of not just the field of data compression, but more so of the area of error-free data communication. (The paper remains surprisingly accessible and short for its legendary nature. Do take a look!)

We will definitely come back in the future lectures to understand the work in more detail, but here is another thumb rule from Shannon's work for the optimal code-lengths:

Note that this is still a thumb rule, until we prove it and show convincingly in which scenarios it holds true. But, lets try to verify this rule in a couple of simple cases:

  1. Uniform Distribution: Let's consider the case of We know that as we have 4 equiprobable symbols, we can encode data using 2 bits/symbol. This matches the thumb rule above: In general, we see that if the probability is for all the symbols of a distribution, then the optimal codelength is going to be , according to the thumb rule. Now, in cases where is not a power of , and if we have a single unique codeword per symbol, then the best we can achieve is

  2. Dyadic distribution: Based on the thumb rule, we need to be an integer, which is only possible if is a negative power of for all symbols. Such distributions are known as dyadic distributions

    Dyadic distribution A distribution is called dyadic if

    Based on the thumb rule, it is unambiguously clear that for dyadic distribution, the symbol should have codeword of length . For example: for the following dyadic distribution the code-lengths are going to be:

    SymbolProbOptimal codelength
    A1/21
    B1/42
    C1/83
    D1/83
  3. General Distributions: For generic distributions, , might not be achievable in general. But, a good goal to aim for is: Is this possible? Lets look at this problem in the next section

Designing prefix-free codes

Okay! So, now that we know that the code-lengths to shoot for are: let's try to think how.

Let's take a simple example (see below) and see if we can try to come up with a prefix-free code with the prescribed lengths.

SymbolProbOptimal codelength
A0.551
B0.252
C0.14
D0.14

We know that all prefix-free codes have a corresponding prefix-free tree. So, essentially we want to come up with a binary tree with the leaf nodes at a distance equal to the code-lengths from the root node. For example, in the above example, we want to construct a binary tree with leaf nodes at distance 1, 2, 4, 4 from the root node. Note that there can be some additional leaf nodes to the binary tree which are not assigned to any codeword.

  1. Let's start with nodes at distance 1. We know a binary tree has 2^1 = 2 nodes at distance 1 from the root node. These correspond to codewords 0 (left node) and 1 (right node). Let's make the left node a leaf node corresponding to symbol (with codelength = 1 as needed)
graph TD
     *(Root) -->|0| A:::endnode
     *(Root) -->|1| n1(.)
  1. Now we have the right node (corresponding to codeword 1), which we can split further into two nodes corresponding to 10 and 11. As we are needed to assign a codeword with length 2, lets assign node corresponding to 10 to symbol , and make it a leaf node.
graph TD
     *(Root) -->|0| A:::endnode
     *(Root) -->|1| n1(.)
     n1 -->|10| B:::endnode
     n1 -->|11| n2(.)
  1. We are now again left with the right node 11, which we can split further into two nodes 110, 111 at distance = 3 from the root node. Now, looking at the table of code-lengths, we do not have any code to be assigned to length 3, so lets split node 110 further into nodes corresponding to 1100 and 1101. We can now assign these two nodes to symbols respectively.
graph TD
     *(Root) -->|0| A:::endnode
     *(Root) -->|1| n1(.)
     n1 -->|10| B:::endnode
     n1 -->|11| n2(.)
     n2 --> |110| n3(.)
     n2 --> |111| n4(.)
     n3 --> |1100| C:::endnode
     n3 --> |1101| D:::endnode

Thus, our final codewords are:

SymbolProbOptimal codelengthcodewords
A0.5510
B0.25210
C0.141100
D0.141101

Notice that the leaf node corresponding to 111 was not linked to any symbol. That is absolutely ok! It just implies that we could have squeezed in another symbol here with length 3, and points to the slight inefficiency in using the approximation .

The procedure we followed seems general enough that we can follow it for any distribution:

  1. Sort the distribution in descending order and compute codelengths as
  2. Assign the first lexicographically available codeword to the next symbol which is not a prefix of the previous ones and is of the stipulated length.

We still need to argue the correctness of the procedure and that it will work in all cases! For example, how can we be sure that we will not run out of nodes?

We will look at all these problems in the next lecture.

Summary & Next Lecture

To summarize: here are the key takeaway's for this lecture:

  1. Prefix-free codes: Among the symbol codes, prefix-free codes allow for efficient and convenient encoding/decoding. In fact as we will see, given any uniquely-decodable code (or lossless code), we can construct a prefix-free code with exact same code-lengths and thus the same compression performance. Thus, we restrict ourselves to the analysis of prefix-free codes among the symbol codes.

  2. Prefix-free tree: Any prefix-free code has a prefix-free binary tree associated with it, which has the leaf nodes corresponding to the codewords.

  3. Compression thumb rule: One, somewhat obvious thumb rule we learnt for symbol codes was that if else we could swap the codewords for symbols and achieve lower average codelength.

  4. Neg Log-likelihood: Another thumb-rule we learnt was that the optimal code-lengths for a prefix-free code are We will understand why this rule is true in the next lecture.

  5. Prefix-free code construction: We looked at a procedure for constructing prefix-free codes for any distribution. In the next lecture, we will justify the correctness of the procedure and also discuss a couple of more ideas of constructing prefix-free codes

Designing prefix-free codes

As a recap, in the previous lecture we discussed how to design a prefix free code given the code lengths. We discussed a simple procedure which constructs a prefix-free tree given the code-lengths.

We also saw a simple thumb rule which tells us what the code-lengths of the prefix-free code should be. In this lecture we are going to discuss two things:

  1. Justify that correctness of our prefix-free tree construction with lengths
  2. Look at a few more prefix-free code constructions

Kraft Inequality & converse

With the goal of proving the correctness of the prefx-free tree construction, we will first look at a simple but fundamental property of binary trees, called the Kraft-Mcmillan Inequality (or simply the Kraft Inequality)

Theorem-1: Kraft Inequality

Consider a binary tree, where the leaf nodes are at depths from the root node respectively.

Then the node depths satisfy the inequality:

The inequality is quite elegant, and so is its proof. Any thoughts on how the proof might proceed? Here is a hint:

Hint: Let . Then, the Kraft inequality can be written as: All we have done here is multiply both sides by , but this simple transformation will help make the inequality more interpretable! Can you see the proof now? Here is a proof sketch:

  • Let's try to interpret the RHS, are the number of nodes of the binary tree at depth .
  • The LHS also has a natural interpretation: Given a leaf node at depth , one can imagine that it corresponds to nodes at depth .
graph TD
  *(Root) -->|0| A:::endnode
  A -.-> n6(.):::fake
  A -.-> n7(.):::fake
  n6-.-> n8(.):::fake
  n6-.-> n9(.):::fake
  n7-.-> m1(.):::fake
  n7-.-> m2(.):::fake
  *(Root) -->|1| n1(.)
  n1 -->|10| B:::endnode
  B -.-> m3(.):::fake
  B -.-> m4(.):::fake
  n1 -->|11| n2(.)
  n2 --> |110| C:::endnode
  n2 --> |111| D:::endnode
  classDef fake fill:#ddd;

For example in the tree example above, node has 4 nodes corresponding to it at depth = 3, while node has 2 nodes.

  • It is clear that the nodes at depth are distinct for each of the leaf nodes (Think why?). As the "total number of nodes at depth ", is larger than "the sum of nodes at depth corresponding to leaf nodes , we get the inequality

  • This completes the proof sketch for the Kraft Inequality:

Well, that was a short and simple proof! It is also clear that the equality is true, if and only if there is no leaf node left unaccounted for.

We can use the Kraft inequality to now show the correctness of the prefix-free tree construction with code-lengths , as we discussed in last lecture.

Prefix-tree construction correctness

To recap, our prefix-free tree construction proceeds as follows:

  1. We are given probability distribution for symbols . WLOG assume that: Now, compute code-lengths such that: Thus, the lengths, satisfy

  2. The prefix-free tree construction follows by starting with an empty binary tree, and then recursively adding a leaf node at depth to the binary tree.

We want to argue the correctness of the tree construction at each step, i.e. we want to show that when we are adding node , there will always be a node available for us to do so.

Let's proceed towards showing this inductively.

  • In the beginning, we just have the root node, so we can safely add the node with length . To do so, we need to create a binary tree with leaf nodes, and just assign one of the leaf nodes to node
  • Now, let's assume that we already have a binary tree with nodes and that we want to add node . We want to argue that there will always be a leaf node available with depth in the binary tree . Let's see how can we show that:
  • If we look at the code-lengths, i.e. the depths of the nodes , we see that they follow the Kraft inequality Now as it implies that the node depths of the tree satisfies (for )
  • We know from the Kraft inequality that if the inequality is not tight, then there will be a leaf node available at depth. Now, as , we can safely say that the node can be added to the binary tree .
  • This completes the correctness proof.

In fact if you look at the proof, all we have used is the fact that . Thus, the same proof also gives us the following converse theorem of the Kraft inequality:

Theorem-2: Converse of Kraft Inequality

Let such that:

then, we can always construct a binary tree with leaf nodes at depths from the root node for .

Kraft inequality and the thumb rule

In the last chapter we introduced the thumb rule without any justification. Since then we have seen a code construction that gets close to this thumb rule. Here we briefly sketch how Kraft inequality can be shown to justify the thumb rule. Details can be found in section 5.3 of extremely popular book by Cover and Thomas "Elements of Information Theory". The idea is to consider the optimization problem This is simply the optimization problem to minimize the expected code lengths for a prefix code (Kraft's inequality gives a mathematical way to express the constraint). While this is a integer optimization problem (due to the code lengths being integral) and hence is hard to solve, we can relax the integer constraint and try to solve this for any positive . Then the method of Lagrange multipliers from convex optimization can be used to sovle this problem and obtain the optimal code lengths . In the next chapter, we will look again into this and derive the thumb rule in a different way (though the Kraft inequality still plays a crucial role).

Entropy & Neg-log likelihood thumb rule

In this lecture, our goal is to understand where the thumb rule for prefix-free code-lengths came from. To understand this, we are going to take a brief detour and get a sneak-peek into the area of Information Theory.

Information theory is the science of information, and is an area which tries to understand the fundamental nature of what is information, how do we store it, how do we communicate it etc. The whole field (along with some of the most fundamental results in the field) was laid down in a single seminal paper A Mathematical theory of Communication by Claude Shannon in 1948.

Rather than talking about information in philosophical terms, we are going to understand information theory mathematically. We are going to first define some quantities and properties central to information theory. As we will see, understanding the mathematical properties of these quantities is what will lead us to the intuition and understanding of information theory.

Entropy

So, lets get started: the first property, or information-theoretic measure we are going to define is the Entropy. The entropy, although a term overloaded with a lot of meaning in the (math/science) pop culture, has a precise definition in information theory.

Entropy

Let be a random variable, with alphabet and discrete probability distribution . i.e. one can imagine the samples generate by the random variable to be independent and identically distributed as per distribution .

Then the entropy of the random variable is defined as:

Some comments:

  • As you can see, is a simple function of the distribution . So, some works also denote as a . We will use both of these notations interchangeably; but note that that the notation is more precise and will be more helpful when we deal with more complex distributions (like Markov distributions).
  • Another way to look at is as expectation over the negative-log likelihood function Although, this feels a bit weird, the neg-log likelihood function is a completely legitimate function of the random variable . All the different viewpoints are useful in different scenarios

Properties of Entropy

Now that we have defined Entropy, lets take a look at some properties of . To prove most of these properties, we are going to need a simple but a very profound property of convex functions: the Jensen's inequality.

Jensen's inequality

Jensen's inequality

Let be a convex function. Let be a random variable on with discrete probability distribution be some probability distribution. Then:

Similarly if is a concave function, then the reverse inequality is true:

In both cases, equality holds iff (if and only if)

The Jensen's inequality can be also understood succinctly using the figure below. for clarity lets assume that takes values in .

Jensen's inequality

  • If one looks at the LHS: , then it can be understood as simply a weighted average with weights over the points . This weighted average corresponds to a point somewhere on the segment joining points and
  • If one looks at the RHS: . Thus, the RHS corresponds to the value of the convex function at the weighted averaged point . It is clear from the figure that because of the "bowl-shaped* nature of , .
  • It is also clear from the figure why the equality condition holds true iff

We won't prove the Jensen's inequality here, but it is useful to understand and visualize the inequality. For a rigorous proof look at the wiki article on Jensen's Inequality. In the simple discrete case, it can be proved using induction.

Okay, we are all set to look at some properties of Entropy. As you can guess, in our case, we will be specifically using Jensen's inequality on the concave function .

1. Non-negativity of Entropy: :

It is easy to see that . The reason is that: , as

2. Upper bound on Entropy: :

We know that . We can also compute an upper bound on : The first step of the proof follows from Jensen's inequality for the Concave function . Also, note that the equality is possible only if: which implies that

3. Joint entropy of independent random variables:

If we have random variables , we can define their entropy simply as the entropy of the pair . The joint entropy is denoted as and can be expanded as

Now if and are independent random variables, then we have

Thus, the joint entropy of independent random variables is simply the sum of their individual entropies. This generalizes to independent random variables using induction. In particular, if the -tuple consists of i.i.d. random variables distributed like , then . The general case of dependent random variables and their joint entropy is studied in a later chapter.

4. Entropy as a lower bound: for any distribution :

Let us spell out this property in a bit more detail. Given any probability distribution , the following is true: the equality holds iff , i.e. we want to show that Okay, so proving the lower bound property is equivalent to showing that . Any idea how can we do this?

The idea is again to use Jensen's inequality:

Note that we used Jensen's inequality in step (2), but the sign of the inequality is reversed because of the negation. We also know that equality holds iff i.e. if !

Alright! This proves our property as

Let's pause a bit and think about what this property is telling:

  • Essentially we are telling that is the solution to the minimization problem: We will see that this characterization is often quite useful.
  • Another corollary we proved is that for any two distributions . The quantity on the right is also known as the KL-Divergence between the two distributions.

KL-Divergence

Let and be two given probability distributions. Then the KL-Divergence is defined as:

The following property holds for :

Equality holds iff

The KL-Divergence is a very important information theoretic quantity, which we will revisit!


Okay, now that we have defined and got a bit familiar with some of the information theoretic measures, we are all set to understand where the thumb rule comes from!

To do so, let's try to answer the question:

Question: Optimal prefix-free code

Given a probability distribution ,

  1. What is the optimal prefix-free code for this distribution?
  2. What is the best compression it can achieve?

Optimal prefix-free codes

The question seems quite daunting! How can we even go about approaching this one? The space of all prefix-free codes is quite large, and not very structured.

Luckily, based on the converse of Kraft's inequality, we know that the space of all prefix-free codes can be characterized by the inequality:

where represent the code-lengths (i.e the depths of the nodes in the prefix-free tree)

Thus, the problem of finding the optimal prefix-free code given a probability distribution can be formulated as a concise optimization problem.

Optimal prefix-free code

The problem of finding optimal prefix-free code given a probability distribution can be formulated as:

Given probabilities , solve for code-lengths , such that:

Quite elegant! We have disassociated the problem of finding optimal prefix-free codes from unclear search spaces over trees and so on. Instead, we now just have this concrete optimization problem. Unfortunately we have integers to optimize over; and integer optimization problems might not be a feasible problem in general. In this case however, we can indeed solve this problem! We will look at a solution in the next lecture.

For this lecture, our goal is to see what best average compression performance we can achieve, i.e what is the solution (or a lower bound) to the minimization objective

Lower bounding the average codelength

Okay, lets try to see how we can obtain a lower bound for , given the constraint that .

Let's start by adding a dummy length , so that: The benefit of this formulation is that we can now imagine as probability values, and use the plethora of properties we proved earlier! Note that might not be an integer, but that is okay.

Let's also define , so that the new objective is equal to the average codelength.

Okay, now that we have the setup ready, let us try to analyze the average codelength:

All we have done until now is some basic math yoga, but the outcome is that we have transformed the objective into a form more amenable for our purpose. Do you now see the lower bound?

Using the property 3 of entropy for distribution and proved above, we can now show that:

Thus, the lower bound on average codelength of any prefix-free code is the Entropy ! It is also easy to see that the equality holds if and only if: for all .

To summarize the important theorem:

Theorem-3: Entropy as a lower bound on avg. codelength

Given a distribution for data. Any prefix-free code with code-lengths will have average codelength lower bounded by the entropy of the source distribution :

The equality holds if and only if: for all , i.e:

Some observations/comments:

1. Thumb-rule justification

Notice that the equality condition tells us that optimal compression can be achieved by the prefix-free code iff:

and, this is the justification for the thumb rule , we discussed in the previous lecture and used for construction of Shannon code.

2. Shannon code performance

Another point to note is that, we can never achieve avg codelength equal to entropy , unless , i.e. all the probabilities are powers of . This can be quite unlikely in real life, which is a bit of a bummer.

The positive news is that we can in fact get quite close to .

Let's analyze the performance of Shannon code. We know that Shannon code has code-lengths , thus the average codelength is:

Thus, Shannon code guarantees that we can always achieve average code-lengths within bit of the optimal. This is quite promising!

Essentially, unless the entropy is very low, we are not missing much by using Shannon codes.

3. Coding in blocks

As discussed above Shannon codes only guarantee performance up to 1 bit of entropy, and this gap is too much in some scenarios, especially with highly compressible data. In such cases, one common strategy is to code in blocks. This simply means we can treat an input -tuple as a symbol in a bigger alphabet, and apply Shannon code to this alphabet. So we split the input stream into blocks of size symbols, encode each block and concatenate the results. Here again, Shannon codes get us to within 1 bit of entropy and hence we have (denoting the expected code length by ):

Using the properties of joint entropy for i.i.d. random variables, and dividing by we get which means coding in blocks of brings us within bits of entropy. This means we can get arbitrary close to entropy using block coding and hence entropy is not only the lower bound, but it is also achievable. In the next chapter, we will look at some examples of this phenomenon where we'll see how the average code length improves as we code with bigger and bigger blocks. But there is no free lunch with block coding! Note that a naive implementation would lead to a complexity exponential in as if the alphabet size is per-symbol, with block-coding the number of symbols in the alphabet an encoder needs to handle expands to a size of . Moreover, we now have a decoding delay of since we cannot just go and decode a single symbol until we get the entire block. In later chapters, we will look at clever implementations that successfully get the benefits of block coding while reducing this penalty.

4. Entropy as a lower bound for uniquely decodable codes

We showed that lower bounds the average code-lengths for prefix-free codes. But, what about general symbol codes which are uniquely decodable (or lossless)?

We saw that even for general uniquely decodable symbol codes, the Kraft inequality holds true. i.e: Notice that this is precisely the only property we use for proving the lower bound on , and so, entropy is in fact a lower bound on average code-lengths for general uniquely decodable codes.

5. Mismatched coding and KL-divergence

We saw above that the code optimized of a source will have code lengths . This achieves expected code length close to entropy:

Now, consider a scenario where we optimize a code for source distribution but then use it for a source with distribution . In this case, the code lengths will be and the average code length will be In this scenario, the optimal code length was but the actual average code length exceeds it by . Thus the KL-divergence is the penalty for mismatched coding for using the incorrect distribution for designing the code. Later in the class we will study universal codes which imply existence of distributions that can closely approximate any distribution from a large class, meaning that a single code (over a block) is able to perform well irrespective of the actual source distribution.

Entropy beyond data compression

Given a random variable over with distribution , we know that the fundamental limit on compression of the random variable using symbol codes is .

is sometimes also referred to as the self-information of the random variable , as in a way, we are saying that the amount of information (on an average) in each instantiation of the random variable is bits. This is quite natural, as we should not be able to compress an input better than the amount of information it is worth.

The discussion, although a bit hand-wavy is quite fundamental. Due to this, we see popping up at lots of other places as a fundamental limit of a problem, and not just in case of compression.

For example, here are a few scenarios, for which the answer is related to . Try to solve them for fun!

1. Simulating non-uniform random variables: Lets say you want to simulate a random variable over with distribution . To simulate the random variable, you are only allowed to use fair coin tosses. In that case, what is fundamental limit on the average number of coins you would have to toss to simulate ?

2. Simulating a fair coin using a biased coin: Lets now say you have a biased coin with probability of H/T to be respectively. The goal is to use this coin to simulate a fair coin. One simple way to do this is to toss the coin twice, and associate (H,T) as heads, and (T,H) as tails. If we get (H,H), (T,T), then we discard the result and toss again. Is this the optimal strategy though? Can you do better? What is the fundamental limit on the average number of biased coins you would have to toss do to obtain a fair coin?

3. Biased binary search: We are familiar that given a sorted array of size , we can locate a query in one of the bins using comparisons. However, a lot of times, we already have a good idea which bins the query might lie in. Can we accelerate the binary-search in this case? To be more precise, lets say you have a bias that the query will lie in each of the bins with probability . (usually this bias is uniform, i.e ). In such a scenario what is the fundamental limit on the average number of comparisons you have to do to locate the query ?

Although all these scenarios seem very different, they are essentially related to Entropy , and use the property that the self information, the amount of bits corresponding to random variable is . This key intuition led to a very diverse and vibrant area of mathematics called Information Theory. What we saw in today's lecture is just a small glimpse into information theory, but if you are interested please take a look at the course EE 276, and the resources listed there.

Summary and next lecture

To summarize this lecture:

  1. Entropy: We took a detour into information theory and understood some properties of entropy .
  2. Entropy as a limit of compression: We derived the properties of and used them to show that is the fundamental limit on the compression, i.e. the average codelength of any prefix-free code designed for compressing a given source is always lower bounded by .

Although we were able to show that is the fundamental limit on average codelength, and that we can always come within 1 bit of the fundamental limit using Shannon codes, it is a bit unsatisfactory as we don't know if Shannon Code is the optimal prefix code.

In the next lecture, we are doing to discuss Huffman Code, which is the answer to the question:

Given a random variable with instances sampled i.i.d with probability distribution , what is the best prefix free code for this source?

Huffman Code

Introduction

As a recap, in the last lecture we formulated the problem of finding the optimal prefix-free code given a probability distribution as a concrete optimization problem as follows:

Optimal prefix-free code

The problem of finding optimal prefix-free code given a probability distribution can be formulated as:

Given probabilities , solve for code-lengths , such that:

We also showed that the optimal average codelength is lower bounded by , the entropy corresponding to source . (i.e. ). In this lecture we are going to discuss the famous Huffman codes, which is in fact (one of) the solutions to the optimization problem above!


The problem of finding optimal prefix-free codes eluded researchers for quite some time; until David Huffman, then a mere graduate student in the EE dept at MIT solved it as his course project! Here is an anecdote about the discovery of Huffman codes, borrowed from the maa webpage.

"The story of the invention of Huffman codes is a great story that demonstrates that students can do better than professors. David Huffman (1925-1999) was a student in an electrical engineering course in 1951. His professor, Robert Fano, offered students a choice of taking a final exam or writing a term paper. Huffman did not want to take the final so he started working on the term paper. The topic of the paper was to find the most efficient (optimal) code. What Professor Fano did not tell his students was the fact that it was an open problem and that he was working on the problem himself. Huffman spent a lot of time on the problem and was ready to give up when the solution suddenly came to him. The code he discovered was optimal, that is, it had the lowest possible average message length. The method that Fano had developed for this problem did not always produce an optimal code. Therefore, Huffman did better than his professor. Later Huffman said that likely he would not have even attempted the problem if he had known that his professor was struggling with it."

Huffman code construction

Even before we get started, it is important to note that Huffman codes are one of the optimal prefix-free codes for a given distribution, and that there are multiple of those. It is easy to see why: for example, given codewords 0, 10, 110, ... from Huffman code, we can just switch 0 with 1 and that is another legit optimal prefix-free code (1, 01, 001, ...). So, we are going to discuss one of the many possible optimal prefix free constructions.

Let's first discuss the construction of the Huffman codes, and then we will discuss the optimality. Let's start with an example probability distribution:

probs_dict = {"A": 0.35, "B": 0.25, "C": 0.2, "D": 0.12, "E": 0.08}

STEP 0: The initial step is to first build singleton nodes from these probabilities.

graph TD
  N1["A<br/>(p=0.35)"]:::endnode
  N2("B<br/>(p=0.25)"):::endnode
  N3("C<br/>(p=0.2)"):::endnode
  N4("D<br/>(p=0.12)"):::endnode
  N5("E<br/>(p=0.08)"):::endnode

Note that each Node has the following structure.

class Node:
    str symbol_name
    float prob
    Node left = None
    Node right = None

Each node has a symbol_name and a prob fields. The node also has left and right fields, which are pointers to its children.

node_list = [Node(A, 0.35), Node(B, 0.25), Node(C, 0.2), Node(D, 0.12), Node(E, 0.08)]

Essentially we now have 5 nodes (or binary trees with a single node). The Huffman tree construction works by joining these nodes in a recursive fashion using the next 2 steps, to construct a single tree.

Step 1: We pop out the two nodes with the smallest probability from the node_list. In our example, these are Node(D, 0.12) and Node(E, 0.08)

Thus, the node_list now looks like:

node_1, node_2 = Node(D, 0.12), Node(E, 0.08)
node_list = [Node(A, 0.35), Node(B, 0.25), Node(C, 0.2)]

Step 2: In the next step, we join the two popped nodes and create a new node as a parent of these two nodes. The probability of the new node is equal to the sum of the two children nodes. In our example, the new node is node N1 and has prob = 0.12 + 0.08 = 0.2.

This new node is re-added to the node_list.
So, now our node list is:

## Iter 1
node_list = [Node(A, 0.35), Node(B, 0.25), Node(C, 0.2), Node(N1, 0.2)]
graph TD
  N1["A<br/>(p=0.35)"]:::endnode
  N2("B<br/>(p=0.25)"):::endnode
  N3("C<br/>(p=0.2)"):::endnode
  N4("D<br/>(p=0.12)"):::endnode
  N5("E<br/>(p=0.08)"):::endnode
  N6("N1<br/>(p=0.2)") --> N4
  N6 --> N5

  style N6 fill:#dddddd

Let's pause and see what the two steps above did: Step 1,2 reduced the length of node_list by 1 (as we removed 2 nodes and added 1). We can thus apply Step 1, Step 2 recursively, to obtain a single node/tree. The resulting final tree is our Huffman tree!

## Iter 2
node_list = [Node(A, 0.35), Node(B, 0.25), Node(N2, 0.4)]
graph TD
  N1["A<br/>(p=0.35)"]:::endnode
  N2("B<br/>(p=0.25)"):::endnode
  N3("C<br/>(p=0.2)"):::endnode
  N4("D<br/>(p=0.12)"):::endnode
  N5("E<br/>(p=0.08)"):::endnode
  N6("N1<br/>(p=0.2)") --> N4
  N6 --> N5
  N7("N2<br/>(p=0.4)") --> N6
  N7 --> N3

  style N6 fill:#dddddd
  style N7 fill:#dddddd
## Iter 3
node_list = [Node(N2, 0.4), Node(N3, 0.6)]
graph TD
  N1["A<br/>(p=0.35)"]:::endnode
  N2("B<br/>(p=0.25)"):::endnode
  N3("C<br/>(p=0.2)"):::endnode
  N4("D<br/>(p=0.12)"):::endnode
  N5("E<br/>(p=0.08)"):::endnode
  N6("N1<br/>(p=0.2)") --> N4
  N6 --> N5
  N7("N2<br/>(p=0.4)") --> N6
  N7 --> N3
  N8("N3<br/>(p=0.6)") --> N1
  N8 --> N2

  style N6 fill:#dddddd
  style N7 fill:#dddddd
  style N8 fill:#dddddd
## Iter 4
node_list = [Node("*", 1.0)]
graph TD
  N1["A<br/>(p=0.35)"]:::endnode
  N2("B<br/>(p=0.25)"):::endnode
  N3("C<br/>(p=0.2)"):::endnode
  N4("D<br/>(p=0.12)"):::endnode
  N5("E<br/>(p=0.08)"):::endnode
  N6("N1<br/>(p=0.2)") --> N4
  N6 --> N5
  N7("N2<br/>(p=0.4)") --> N6
  N7 --> N3
  N8("N3<br/>(p=0.6)") --> N1
  N8 --> N2
  N9("*<br/>(p=1.0)") --> N8
  N9 --> N7

  style N6 fill:#dddddd
  style N7 fill:#dddddd
  style N8 fill:#dddddd
  style N9 fill:#dddddd

Thus, our procedure for constructing the Huffman tree can be described in pseudo-code as follows:


# Input -> given prob distribution
probs_dict = {A1: p_1, A2: p_2, ..., Ak: p_k} # p_1, .. are some floating point values

class Node:
    str symbol_name
    float prob
    Node left = None
    Node right = None

    
def build_huffman_tree(probs_array):
    ## STEP 0: initialize the node_list with singleton nodes
    node_list = [Node(s,prob) for s,prob in probs_array.items()]

    # NOTE: at each iter, we are reducing length of node_list by 1
    while len(node_list) > 1:
        ## STEP 1: sort node list based on probability,
        # and pop the two smallest nodes 
        node_list = sort(node_list) # based on node.prob
        node_0, node_1 = node_list.pop(0, 1) # remove the smallest and second smallest elements

        ## STEP 2: Merge the two nodes into a single node
        new_node_prob = node_0.prob + node_1.prob
        new_node = Node(symbol_name="", prob=new_node_prob, left=node_0, right=node_1)
        node_list.append(new_merged_node)
    
    # finally we will have a single node/tree.. return the node as it points to the 
    # root node of our Huffman tree
    return node_list[0] 

Quite cool! We are essentially constructing the tree in by stitching one node at a time, until we have a single tree. Nodes with smaller probabilities join into a sub-tree earlier and thus, they will likely have higher depth in the final tree (and thus longer codelength). As we will see this greedy approach is in fact optimal!

Once the Huffman tree is constructed, we can use this tree as a prefix-free code for encoding/decoding data. For example, in our case the table we get is:

SymbolCodewords
A00
B01
C10
D110
E111

Here is another example below of the Huffman code constructed on a typical text of english language. Notice that vowels, punctuations have much shorter length as they occur quite frequently.

img2

img1

Optimizing the tree construction

There are certain optimizations which can be performed in the Huffman tree construction:

For example, STEP 1 in the pseudocode can be replaced with a priority queue, as we mainly need top two elements and nothing more. Here is the full code implementation in the Stanford Compression Library: https://github.com/kedartatwawadi/stanford_compression_library/blob/main/compressors/huffman_coder.py.

Optimality of Huffman codes

To get some intuition into the Huffman code optimality, let's think about what are the necessary conditions for a prefix-free code to be optimal? Let me list a couple of conditions, and we will see why they are true:

Theorem-8: Necessary conditions for prefix-free code optimality

Any optimal prefix free code for a distribution must satisfy the following conditions:
Condition 1 The code-lengths are ordered in inverse order of the probabilities, i.e. Condition 2 The two longest codewords have the same code-lengths

Condition 1 proof

Condition 1 is quite obvious but fundamental: it just says that if the probability of a symbol is higher than it necessarily has to have a shorter codelength. We have seen condition Condition 1 before, as a thumb-rule. But, lets show that it is explicitly true in this case. Let's assume that our code is optimal but for has . Let the average codelength of code be . Now, lets consider a prefix-free code where we exchange the codewords corresponding to i and j. Thus, the new code-lengths are:

The new average codelength is:

As , it implies that , which is in contradiction to our assumption that code is optimal. This proves the Condition 1.

Condition 2 proof

The Condition 2 is also equally simple to prove. Let's show it by contradiction again. For example, let be the unique longest codeword. Then, as there is no other codeword of length , we can be sure that we can drop one bit from , and the resulting code will still be a valid prefix-free code. Thus, we can assign , and we get a shorter average codelength!

As we can see, these two properties are satisfied by the construction procedure of Huffman coding.

  1. The symbols with higher probability are chosen later, so their code-lengths are lower.
  2. We always choose the two symbols with the smallest probability and combine them, so the Condition 2 is also satisfied.

Note that these are just observations about Huffman construction and don't prove its optimality. The optimality proof is a bit more technical, and we refer you to the Cover, Thomas text, section 5.8.

The main property is that after the merge step of merging two smallest probability nodes of probability distribution , the remaining tree is still optimal for the distribution .

Practical prefix-free coding

Huffman codes are actually used in practice due to their optimality and relatively convenient construction! Here are some examples:

A lot of times in practice, the data is split into streams and after multiple transformations needs to be losslessly compressed. This is the place where Huffman codes are used typically.

Fast Prefix-free decoding

In practice, Huffman coders lie at the end of any compression pipeline and need to be extremely fast. Let's look at the typical encoding/decoding and see how we can speed it up.

For simplicity and clarity we will work with a small exmaple. Let's say this is our given code:

encoding_table = {"A": 0, "B": 10, "C": 110, "D": 111}
  1. The encoding itself is quite simple and involves looking at a lookup table.

## Prefix-free code encoding
encoding_table = {"A": 0, "B": 10, "C": 110, "D": 111}
def encode_symbol(s):
    bitarray = encoding_table[s]
    return bitarray, len(bitarray)

The typical prefix-free code decoding works by parsing through the prefix-free tree, until we reach a leaf node, e.g. see the code in SCL: https://github.com/kedartatwawadi/stanford_compression_library/blob/main/compressors/prefix_free_compressors.py. Here is a pseudo-code for the decoding algorithm:

## Prefix-free code decoding: 
prefix_free_tree = ... # construct a tree from the encoding_table
def decode_symbol_tree(encoded_bitarray):
    # initialize num_bits_consumed
    num_bits_consumed = 0

    # continue decoding until we reach leaf node
    curr_node = prefix_free_tree.root_node
    while not curr_node.is_leaf_node:
        bit = encoded_bitarray[num_bits_consumed]
        if bit == 0:
            curr_node = curr_node.left_child
        else:
            curr_node = curr_node.right_child
        num_bits_consumed += 1

    # as we reach the leaf node, the decoded symbol is the id of the node
    decoded_symbol = curr_node.id
    return decoded_symbol, num_bits_consumed

Although this is quite simple, there are a lot of branching instructions (e.g. if, else conditions). These instructions are not the fastest instructions on modern architectures. So, here is an alternative decoding procedure for this example:

## faster decoding
state_size = 3
decode_state = {
    000: "A",
    001: "A",
    010: "A",
    011: "A",
    100: "B",
    101: "B",
    110: "C",
    111: "D",
}
codelen_table = {
    "A": 1,
    "B": 2,
    "C": 3,
    "D": 3,
}
def decode_symbol_tree(encoded_bitarray):
    state = encoded_bitarray[:state_size] 
    decoded_symbol = decode_state[state]
    num_bits_consumed = codelen_table[decoded_symbol]
    return decoded_symbol, num_bits_consumed

Looking at the new decoding pseudo-code, we see following observations:

  1. The state_size=3 corresponds to the maximum depth of the prefix-free tree.
  2. The state is the first 3 bits of the encoded_bitarray. We use the state as an index to query thr decode_state lookup table to obtain the decoded_symbol.
  3. As the state is sometimes reading in more bits that what we wrote, we need to output the num_bits_consumed correctly to be codelen_table[decoded_symbol].

One natural question is why should the decoded_symbol = decode_state[state] operation output the correct state? The reason is because of the prefix-free nature of the codes. So, if we look at the entries in decode_state table corresponding to B, we see that these are 100, 101 both of which start with 10 which is the codeword corresponding to B.

Based on first look, the decoding is indeed simpler with this new decoding procedure, but we do have to pay some cost. The cost is in terms of memory consumption. For example, we now need to maintain a table decode_state which has size . This is clearly more memory than a typical prefix-free tree. In a way we are doing some caching. But for caching to work properly, the table sizes need to be small. The bigger the lookup table, the slower the table lookup is. So, practically the maximum leaf depth needs to be kept in check.

Thus, practically we want to get a Huffman tree with a maximum leaf depth constraint. Typically, the depth is also set to be 16 or 24 so that it is a byte multiple. For more details on efficient Huffman Coding, we encourage the reader to have a look at this blog entry by Fabian Giesen on Huffman coding: https://fgiesen.wordpress.com/2021/08/30/entropy-coding-in-oodle-data-huffman-coding/?hmsr=joyk.com&utm_source=joyk.com&utm_medium=referral.

Asymptotic Equipartition Property

We have been looking at the problem of data compression, algorithms for the same as well as fundamental limits on the compression rate. In this chapter we will approach the problem of data compression in a very different way. The basic idea of asymptotic equipartition property is to consider the space of all possible sequences produced by a stochastic (random) source, and focusing attention on the "typical" ones. The theory is asymptotic in the sense that many of the results focus on the regime of large source sequences. This technique allows us to recover many of the results for compression but in a non-constructive manner. Beyond compression, the same concept has allowed researchers to obtain powerful achievability and impossibility results even before a practical algorithm was developed. Our presentation here will focus on the main concepts, and present the mathematical intuition behind the results. We refer the reader to Chapter 3 in Cover and Thomas and to Information Theory lecture notes (available here) for some details and proofs.

Before we get started, we define the notation used throughout this chapter:

  1. Alphabet: specifies the possible values that the random variable of interest can take.
  2. iid source: An independent and identically distributed (iid) source produces a sequence of random variables that are independent of each other and have the same distribution, e.g. a sequence of tosses of a coin. We use notation for iid random variables from distribution .
  3. Source sequence: denotes the -tuple representing source symbols, and represents the set of all possible . Note that we use lowercase for a particular realization of .
  4. Probability: Under the iid source model, we simply have where we slightly abuse the notation by using to represent both the probability function of the -tuple and that for a given symbol .

As an example, we could have alphabet and a source distributed iid according to . Then the source sequence has probability .

Before we start the discussion of asymptotic equipartition property and typical sequences, let us recall the (weak) law of large numbers (LLN) that says the empirical average of a sequence of random variables converges towards their expectation (to understand the different types of convergence of random variables and the strong LLN, please refer to a probability textbook).

Theorem-1: Weak Law of Large Numbers

For with alphabet , we have the following for any :

That is for arbitarily small , as becomes large, the probability that the empirical average is within of the expectation is .

With this background, we are ready to jump in!

The -typical set

Definition-1

For some , the source sequence is -typical if

Recall that is the entropy of the random variable . It is easy to see that the condition is equivalent to saying that . The set of all **-typical** sequences is called the **-typical set** denoted as . Put in words, the **-typical set** contains all -length sequences whose probability is close to .

Next, we look at some probabilities of the typical set:

Theorem-2: Properties of typical sets

For any ,

  1. .
  2. For large enough ,

Simply put, this is saying that for large , the typical set has probability close to and size roughly

Proof sketch for Theorem-2

  1. This follows directly from the Weak LLN by noting the definition of typical sequences and the following facts: (i) (since this is iid). (ii) from the definition of entropy. Thus applying the LLN on instead of directly gives the desired result.

Both 2 & 3 follow from the definition of typical sequences (which roughly says the probability of each typical sequence is close to ) and the fact that the typical set has probability close to (less than just because total probability is atmost and close to due to property 1).

Intuition into the typical set

To gain intuition into the typical set and its properties, let and consider the set of all length sequences with size . Then the AEP says that (for sufficiently large), all the probability mass is concentrated in an exponentially smaller subset of size . That is,

Furthermore, all the elements in the typical set are roughly equiprobable, each with probability . Thus, contains within itself a subset that contains almost the entire probability mass, and within the subset the elements are roughly uniformly distributed. It is important to note that these properties hold only for large enough since ultimately these are derived from the law of large numbers. The intuition into typical sets is illustrated in the figure below.

Typical set

Quiz-1: Intuition into the typical set

What does the elements of the typical set look like?

  1. Consider a binary source with . What is the size of the typical set (Hint: this doesn't depend on !)? Which elements of are typical? After looking at this example, can you still claim the typical set is exponentially smaller than the set of all sequences in all cases?
  2. Now consider a binary source with . Recalling the fact that typical elements have probability around , what can you say about the elements of the typical set? (Hint: what fraction of zeros and ones would you expect a typical sequence to have? Check if your intuition matches with the expression.)
  3. In part 2, you can easily verify that the most probable sequence is the sequence consisting of all s. Is that sequence typical (for small )? If not, how do you square that off with the fact that the typical set contains almost all of the probability mass?

Before we discuss the ramifications of the above for compressibility of sequences, we introduce one further property of subsets of . This property says that any set substantially smaller than the typical set has negligible probability mass. Intuitively this holds because all elements in the typical set have roughly the same probability, and hence removing a large fraction of them leaves very little probability mass remaining. In other words, very roughly the property says that the typical set is the smallest set containing most of the probability. This property will be very useful below when we link typical sets to compression. Here we just state the theorem and leave the proof to the references.

Theorem-2: Sets exponentially smaller than the typical set

Fix and such that . Then

Compression based on typical sets

Suppose you were tasked with developing a compressor for a source with possible values, each of them equally likely to occur. It is easy to verify that a simple fixed-length code that encodes each of the value with bits is optimal for this source. But, if you think about the AEP property above, as grows, almost all the probability in the set of -length sequences over alphabet is contained in the typical set with roughly elements (where ). And the elements within the typical set are (roughly) equally likely to occur. Ignoring the non-typical elements for a moment, we can encode the typical elements with bits using the simple logic mentioned earlier. We have encoded input symbols with bits, effectively using bits/symbol! This was not truly lossless because we fail to represent the non-typical sequences. But this can be considered near-lossless since the non-typical sequences have probability close to , and hence this code is lossless for a given input with very high probability. Note that we ignored the 's and 's in the treatment above, but that shouldn't take away from the main conclusions that become truer and truer as grows.

On the flip side, suppose we wanted to encode the elements in with bits for some . Now, bits can represent elements. But according to theorem 2 above, the set of elements correctly represented by such a code has negligible probability as grows. This means a fixed-length code using less than bits per symbol is unable to losslessly represent an input sequence with very high probability. Thus, using AEP we can prove the fact that any fixed-length near-lossless compression scheme must use at least bits per symbol.

Lossless compression scheme

Let us now develop a lossless compression algorithm based on the AEP, this time being very precise. As before, we focus on encoding sequences of length . Note that a lossless compression algorithm aiming to achieve entropy must be variable length (unless the source itself is uniformly distributed). And the AEP teaches us that elements in the typical set should ideally be represented using bits. With this in mind, consider the following scheme:

Lossless compression using typical sets

Fix , and assign index ranging from to the elements in (the order doesn't matter). In addition, define a fixed length code for that uses bits to encode any input sequence. Now the encoding of is simply:

  • if , encode as followed by
  • else, encode as followed by

Let's calculate the expected code length (in bits per symbol) used by the above scheme. For large enough, we can safely assume that by the AEP. Furthermore, we know that needs at most bits to represent (theorem-1 part 2). Thus, denoting the code length by , we have where the first term corresponds to the typical sequences and the second term to everything else (not that we use the simplest possible encoding for non-typical sequences since they don't really matter in terms of their probability). Also note that we add to each of the lengths to account for the or we prefix in the scheme above. Simplifying the above, we have where represents terms bounded by for some constant when is small. Thus we can achieve code lengths arbitrary close to bits/symbol by selecting and appropriately!

Quiz-2: Lossless compression based on typical sets

  1. Describe the decoding algorithm for the above scheme and verify that the scheme is indeed lossless.
  2. What is the complexity of the encoding and decoding procedure as a function of ? Consider both the time and memory usage. Is this a practical scheme for large values of ?

Arithmetic coding

Recap -> Issues with symbol codes:

Lets consider the simple distribution below and think about what designing prefix-free codes for the same

P = {A: 0.1, B: 0.9}
H(P) = 0.47

It is clear that as there are just two symbols, the only reasonable prefix-free code design here is:

A -> 0, B -> 1
Huffman code (or any other prefix-free code) = 1 bit

So the average codelength for huffman code for this data is 1 bit. If we compare it with the entropy of the distribution, our average codelength (1 bit) is quite off. Ideally for a symbol , ideally we want to use bits (to achieve ). But, as we are using a symbol code, we cant use fractional bits. Thus, there is always going to an overhead of upto ~1 bit per symbol with symbol codes. Thus the expected codelength with Huffman codes.

One solutionto this problem, which was by considering block codes. i.e. we consider tuples of symbols from the alphabet as a new "symbol" and construct a Huffman code using that. For example, we can consider blocks of size 2: P = {AA: 0.01, AB: 0.09, BA: 0.09, BB: 0.81}. Here is the Huffman tree:

block_size: 1, entropy: 0.47, avg_codelen: 1.00 bits/symbol

  |--A
·-|  
  |--B
block_size: 2, entropy: 0.47, avg_codelen: 0.65 bits/symbol

       |--BA
  |--·-|  
  |    |    |--AA
  |    |--·-|  
  |         |--AB
·-|  
  |--BB

We see that BB has probability 0.81, so it receives the shortest codelength of 1 bit. The aaverage codelength in this case is 0.65 bits/symbol which is definitely an improvement over the 1 bit/symbol. The idea here is that, as we are using fixed codes per "symbol", as now the symbols are tuples of size 2, the overhead gets amortized to 2 symbols from the alphabet.

We can continue doing this by considering larger blocks of data. For example, with blocks of size , the new alphabet size is , and the overhead due to Huffman coding should be ~1/k (or lower). Here is an example of applying the Huffman coding on blocks of size 3,4,5:

block_size: 3, entropy: 0.47, avg_codelen: 0.53 bits/symbol

                      |--AAA
                 |--·-|  
                 |    |--BAA
            |--·-|  
            |    |    |--ABA
            |    |--·-|  
            |         |--AAB
       |--·-|  
       |    |--ABB
  |--·-|  
  |    |    |--BBA
  |    |--·-|  
  |         |--BAB
·-|  
  |--BBB

## Huffman code for blocks
block_size: 1, entropy: 0.47, avg_codelen: 1.00 bits/symbol
block_size: 2, entropy: 0.47, avg_codelen: 0.65 bits/symbol
block_size: 3, entropy: 0.47, avg_codelen: 0.53 bits/symbol
block_size: 4, entropy: 0.47, avg_codelen: 0.49 bits/symbol
block_size: 5, entropy: 0.47, avg_codelen: 0.48 bits/symbol

We already see that the with block size 4 we are already quite close to the entropy of 0.47. So, this is quite great, and should help us solve our problem of reaching the entropy limit. In general the convergence of Huffman codes on blocks goes as:

  1. Huffman codes

  1. Huffman codes on blocks of size B

However there is a practical issue. Do you see it? As a hint (and also because it looks cool!) here is the huffman tree for block size 5:

block_size: 5, entropy: 0.47, avg_codelen: 0.48 bits/symbol

            |--BABBB
       |--·-|  
       |    |              |--BBABA
       |    |         |--·-|  
       |    |         |    |--ABBBA
       |    |    |--·-|  
       |    |    |    |    |--BABBA
       |    |    |    |--·-|  
       |    |    |         |--AABBB
       |    |--·-|  
       |         |                        |--BBAAA
       |         |                   |--·-|  
       |         |                   |    |--AABAB
       |         |              |--·-|  
       |         |              |    |    |--AAABB
       |         |              |    |--·-|  
       |         |              |         |--ABBAA
       |         |         |--·-|  
       |         |         |    |         |--AABBA
       |         |         |    |    |--·-|  
       |         |         |    |    |    |              |--AAAAB
       |         |         |    |    |    |         |--·-|  
       |         |         |    |    |    |         |    |--AABAA
       |         |         |    |    |    |    |--·-|  
       |         |         |    |    |    |    |    |         |--AAAAA
       |         |         |    |    |    |    |    |    |--·-|  
       |         |         |    |    |    |    |    |    |    |--BAAAA
       |         |         |    |    |    |    |    |--·-|  
       |         |         |    |    |    |    |         |    |--ABAAA
       |         |         |    |    |    |    |         |--·-|  
       |         |         |    |    |    |    |              |--AAABA
       |         |         |    |    |    |--·-|  
       |         |         |    |    |         |--BAAAB
       |         |         |    |--·-|  
       |         |         |         |         |--BABAA
       |         |         |         |    |--·-|  
       |         |         |         |    |    |--ABABA
       |         |         |         |--·-|  
       |         |         |              |    |--BAABA
       |         |         |              |--·-|  
       |         |         |                   |--ABAAB
       |         |    |--·-|  
       |         |    |    |    |--ABBAB
       |         |    |    |--·-|  
       |         |    |         |--BABAB
       |         |--·-|  
       |              |         |--BBAAB
       |              |    |--·-|  
       |              |    |    |--BAABB
       |              |--·-|  
       |                   |    |--ABABB
       |                   |--·-|  
       |                        |--BBBAA
  |--·-|  
  |    |         |--BBBAB
  |    |    |--·-|  
  |    |    |    |--BBBBA
  |    |--·-|  
  |         |    |--BBABB
  |         |--·-|  
  |              |--ABBBB
·-|  
  |--BBBBB

The issue is that, as we increase the block size , our codebook size increases exponentially as . The larger the codebook size the more complicated the encoding/decoding becomes, the more memory we need, the higher the latency etc. For example, if we look at the tree above, the codebook size is . This is quite manageable. However, this wouldn't have been the case if our alphabet size was instead of . In that case, block code of size 5 has a codebook = , which is definitely unmanageable.

Thus, even though with block size , we obtain compression as close as 1/B bits/symbol to entropy , the idea doesn't hold ground practically. That is, in fact the problem which Arithmetic coding solves:

Arithmetic coding: Introduction

Arithmetic coding solves the problems we discussed with the block-based Huffman coder:

  1. Entire data as a single block: Arithmetic coding encodes entire data as a single block: For data , the block_size = n i.e. the entire data is a single block!

  2. codewords are computed on a fly: As the block size for arithmetic coding is the entire data, the codebook size would have been massive (). The codeword is computed on on the fly No need to pre-compute the codebook beforehand.

  3. compression efficiency: Arithmetic coding is optimal in terms of compression. -> theoretically the performance can be shown to be:
    i.e. ~2 bits of overhead for the entire sequence

Thus, we see that Arithmetic coding solves our problem of achieving average compression equal to , at the same time being practical.

How does Arithmetic coding work?

Primer: the number line (in binary)

Before we get to Arithmetic coding, lets take a brief detour towards revisiting the number line, as that will be helpful for the later discussions.

We are familiar with the unit interval [0,1) in decimal floating point. A decimal floating point number can also be represented in the binary alphabet.

img

For example:

  1. 0.5 = 2^{-1} = b0.1
  2. 0.75 = 2^{-1} + 2^{-2} = b0.11

In fact this is similar to how computers represent floating point values. Here are some more examples below:

# floating point values in binary
from utils.bitarray_utils import float_to_bitarrays
_, binary_exp = float_to_bitarrays(0.3333333, 20)

0.3333.. = b0.010101...
0.6666.. = b0.101010...

As you can see, like the decimal case, the binary floating point representations need not be finite (or even repetitive). Getting familiar with thinking about floating point values in binary will be useful to us for Arithmetic coding discussion.

Arithmetic Encoding (theoretical)

Okay, we are all set to discuss Arithmetic encoding. Lets consider a specific running example, as that will make it much easier:

We want to encode the sequence sampled from the distribution .

P = ProbabilityDist({A: 0.3, B: 0.5, C: 0.2})
x_input = BACB

The Arithmetic encoding works in the following two steps:

  1. STEP I: Find an interval (or a range) [L,H), corresponding to the entire sequence
        x_input -> |------------------[.....)------------|
                   0               low     high          1 
  1. STEP II: Communicate the interval [L,H) efficiently (i.e. using less number of bits)
         x_input -> [L,H) -> 011010

STEP-I: finding the interval [L,H)

Lets see how the STEP I (finding the interval L,H)) works. We start with [L,H) = [0,1), and the subdivide the interval as we see each symbol.

  1. x_input[0] = B

img

After we process the symbol B, our interval shrinks from [0,1) to [0.3,0.8). Can you guess the shrinking rule?

The main idea is to find the cumulative probability values for the symbols, and then for symbol with index i, assign it the interval [L,H) = [C(i), C(i+1))

alphabet = [A,B,C]
prob_array = [0.3, 0.5, 0.2]
cumul_array = [0.0, 0.3, 0.8, 1.0]

img

One comment on the interval size is that the interval size is proportional to the probability of the symbol. (i.e. C(i+1) - C(i) = P(i)). We will see that a similar relationship holds as we encode more symbols.

  1. x_input[1] = A Let's now see how the interval updates as we encode the second symbol. As you can see from the image below, we continue subdiving the interval L,H, based on the cumulative probability of the next symbol.

img

Notice that the new interval [L', H') can be described as:

j -> index of the next symbol
P(j), C(j) -> prob and cumulative_prob of j

# update rule
L' = L + C(j)*[H-L]
H' = L' + P(j)*[H-L]

We can continue this way for the entire input.

img

prob = ProbabilityDist({A: 0.3, B: 0.5, C: 0.2})
x_input = BACB

# find interval corresp to BACB
ENCODE: B -> [L,H) = [0.30000,0.80000)
ENCODE: A -> [L,H) = [0.30000,0.45000)
ENCODE: C -> [L,H) = [0.42000,0.45000)
ENCODE: B -> [L,H) = [0.42900,0.44400)

Thus, the final interval is: x_input -> [0.429, 0.444)

For completeness, here is a pesudo-code of the STEP I or Arithmetic coding. Note that in the literature interval and range are terms used interchangeably to describe [L,H).

class ArithmeticEncoder:
    ...

    def shrink_range(self, L, H, s):
        rng = H - L
        new_L = L + (rng * self.P.cumul[s])
        new_H = new_L + (rng * self.P.probs(s))
        return new_L, new_H

    def find_interval(self, x_input):
        L,H = 0.0, 1.0
        for s in x_input:
            L,H = self.shrink_range(L,H,s)
        return L,H

    def encode_block(self, x_input):
        # STEP1
        L,H = self.find_interval(x_input)

        # STEP-II
        ...

Observation: Notice that each time we encode symbol , we shrink the interval size by . For example: we started with interval [L,H) = [0,1) of size 1. We continued shrinking this interval to [0.3,0.8) which has size 0.5 = P(B).

QUIZ-1: What is the size of the interval (H-L) for the input ?

Ans: It is easy to see that the interval size is equal to the probability of the input parsed until then. For example, when we see the first symbol B, the interval size is 0.8 - 0.3 = 0.5 = P(B), when we see the next symbol, the interval size is: 0.45 - 0.3 = 0.15 = P(A)*P(B) = P(AB). This can be generalized to the entire sequence.

STEP-II communicating the interval [L,H)

Until this point, we discussed the STEP I of the Arithmetic encoding, given the input x_input, find an interval corresponding to the input.

P = {A: 0.3, B: 0.5, C: 0.2}
x_input = BACB
L = [0.429, 0.444)

The STEP-II is logically quite simple. We want to communicate the interval . We do this by communicating a value .

For example: , i.e. the midpoint of the range/interval. (in our example Z = 0.4365)

Quiz-2: If the decoder knows:

  • n=4
  • P = {A: 0.3, B: 0.5, C: 0.2}
  • Z = 0.4365

How can it decode the entire input sequence? .

Arithmetic decoding (theoretical)

Let's try to answer the quiz question above. Let's start by plotting out what all we know. We know the value of , the midpoint of the final interval. h:550

How can we use to decode the entire input x_input? Lets start by asking a simpler question: how can we decode the first symbol x_input[0] of the input?

  1. Decoding x_input[0]

The answer is quite simple:

alphabet = [A,B,C]
prob_array = [0.3, 0.5, 0.2]
cumul_array = [0.0, 0.3, 0.8, 1.0]

As the decoder knows the probability distribution, cumulative distribution etc, the decoder can form bins using the cumulative distribution values:

# cumulative bits:
bin_0 -> [C(0), C(1)) = [0.0, 0.3) #A
bin_1 -> [C(1), C(2)) = [0.3, 0.8) #B
bin_2 -> [C(2), C(3)) = [0.8, 1.0) #C

As during the encoding the successive intervals (on encoding each symbol) are subsets of the original intervals, we can see decode the first symbol by checking which interval lies in.

h:550

In our case Z = 0.4365 lies in the bin corresponding to symbol B, so the first input symbol is B.

  1. Decoding x_input[1] Having decoded the first symbol B, let's now see how we can decode the second symbol.

The logic here is again similar, if you recall, during the encoding, we divided the current range further into sub-intervals, and then picked one of them based on the next symbol. For example in this case the sub=intervals are:

# sub-intervals of [0.3, 0.8)
bin_0 -> = [0.3, 0.45) #BA
bin_1 -> = [0.45, 0.7) #BB
bin_2 -> = [0.7, 0.8) #BC

We can again find which bin does belong do, and that will correspond to our next symbol. In our case, the next symbol is A.

h:550

We can continue this recursive process until we have decoded n=4 symbols.

  1. Notice that the intervals are in exact sync during the encoding and decoding, and that is the key to the lossless-ness of the decoding algorithm.

  2. Also note that we do need to mention what the number of encoded symbols is (n=4 in our example), as the decoder can potentially decode infinite sequence using a single . So, we need to tell it when to stop the recursive process.

Z = 0.4365
ENCODE: B -> [L,H) = [0.30000,0.80000)
ENCODE: A -> [L,H) = [0.30000,0.45000)
ENCODE: C -> [L,H) = [0.42000,0.45000)
ENCODE: B -> [L,H) = [0.42900,0.44400)
------------------------------
DECODE: B -> [L,H) = [0.30000,0.80000)
DECODE: A -> [L,H) = [0.30000,0.45000)
DECODE: C -> [L,H) = [0.42000,0.45000)
DECODE: B -> [L,H) = [0.42900,0.44400)

For completeness, the decoding algorithm given the is described below.

class ArithmeticDecoder:
    ...
    def shrink_range(self, L, H, s):
        # same as the encoder
        ...
        return new_L, new_H

    def decode_symbol(self, L, H, Z):
        rng = H - L
        search_list = L + (self.P.cumul * rng)
        symbol_ind = np.searchsorted(search_list, Z)
        return self.P.alphabet[symbol_ind]

    def decode_block(self, Z, n):
        L,H = 0.0, 1.0
        for _ in range(n): #main decoding loop
            s = self.decode_symbol(L, H, Z)
            L,H = self.shrink_range(L,H,s)

Communicating

To recap, the Arithmetic encoder we have does the following:

  1. STEP-I: Find an interval (or a range) [L,H) corresponding to the entire sequence ([0.429, 0.444])

  2. STEP-II: Find the midpoint of the interval , . (Z =0.4365), and communicate to the decoder.

One simple way of communicating , is writing the binary expansion of to the bitstream -> eg: Z = 0.4365 = b0.01101111101... then the final encoded_bitstream = 01101111101... and then just writing the binary expansion to a file.

However there is one problem: Quiz-4: Although our method of writing the binary expansion of to file is cute, it might not give us any compression as 's binary representation can be long, can also have infinite bits. How can we fix this?

The solution to the problem is quite simple. Instead of communicating the entire binary expansion of , we truncate the expansion to bits and communicate this truncated binary string. Let's call the the floating point value corresponding to the truncated binary string as .

Note that we have to be careful regarding how we choose , if is too small the might be steer out of the interval [L,H), which will be a problem for the decoding. Also, choosing too large will hamper our compression performance.

h:550

Thus, our full Arithmetic encoder can be described below:

Arithmetic Encoder

  1. STEP-I: Find an interval (or a range) corresponding to the entire sequence ([0.429, 0.444])

  2. STEP-II: Find the midpoint of the interval , . (Z =0.4365)

  3. STEP-III: Truncate to bits () e.g:

L,H = 0.429, 0.444
Z = 0.4365 = b0.01101111101...
Z_hat = b0.011011111 ~ 0.4296

Final Encoding = encoded_bitstream = 011011111

Determining how many bits to truncate to

Okay, we are almost done finalizing the Arithmetic encoder. The only final missing piece of the jigsaw puzzle is how do we determine , the number of bits to truncate to?

There are in fact two conditions we need to satisfy when choosing :

  1. Cond 1: Truncate to with bits, so that

  2. Cond 2: If has binary representation: Z_hat = b0.011011111 for example, then we also need, any extension of it .

For eg:

Z_hat = b0.011011111
Z_ext = b0.01101111111011110101..

The Cond-1 is somewhat obvious, we want the obtained on truncating to bits, we still lie inside the interval [L,H). Lets now see why the Cond 2 is imposed: Lets say Arithmetic coding leads to Z_hat = b0.011011111, and thus the final encoded bitarray is 011011111. Now, in practice arithmetic coded bitstream is always going to be followed by some other bitstream, and so the decoder is going to see this stream like: 01101111111011110101... the second condition ensures that, even if the decoder reads in some bits not part of the arithmetic encoding, the resultant floating point value will still lie inside the interval [L, H).

With some simple arithmetic, it can be shown that the two conditions together can be written as: