Designing prefix-free codes

As a recap, in the previous lecture we discussed how to design a prefix free code given the code lengths. We discussed a simple procedure which constructs a prefix-free tree given the code-lengths.

We also saw a simple thumb rule $l_{o pt ima l} (sy mb o l) \approx lo g_{2} \frac{1}{p ( sy mb o l )}$ which tells us what the code-lengths of the prefix-free code should be. In this lecture we are going to discuss two things:

Justify that correctness of our prefix-free tree construction with lengths $l (sy mb o l) = ⌈ lo g_{2} \frac{1}{p ( sy mb o l )} ⌉$
Look at a few more prefix-free code constructions

Kraft Inequality & converse

With the goal of proving the correctness of the prefx-free tree construction, we will first look at a simple but fundamental property of binary trees, called the Kraft-Mcmillan Inequality (or simply the Kraft Inequality)

Theorem-1: Kraft Inequality

Consider a binary tree, where the leaf nodes $n_{1}, n_{2}, \dots, n_{k}$ are at depths $l_{1}, l_{2}, \dots, l_{k}$ from the root node respectively.

Then the node depths $l_{1}, l_{2}, \dots, l_{k}$ satisfy the inequality:

$i = 1 \sum k 2^{- l_{i}} \leq 1$

The inequality is quite elegant, and so is its proof. Any thoughts on how the proof might proceed? Here is a hint:

Hint: Let $l_{ma x} = max_{i = 1}^{k} l_{i}$ . Then, the Kraft inequality can be written as: $i = 1 \sum k 2^{l_{ma x} - l_{i}} \leq 2^{l_{ma x}}$ All we have done here is multiply both sides by $2^{l_{ma x}}$ , but this simple transformation will help make the inequality more interpretable! Can you see the proof now? Here is a proof sketch:

Let's try to interpret the RHS, $2^{l_{ma x}}$ are the number of nodes of the binary tree at depth $l_{ma x}$ .
The LHS also has a natural interpretation: Given a leaf node at depth $l_{i}$ , one can imagine that it corresponds to $2^{l_{ma x} - l_{i}}$ nodes at depth $l_{ma x}$ .

graph TD
  *(Root) -->|0| A:::endnode
  A -.-> n6(.):::fake
  A -.-> n7(.):::fake
  n6-.-> n8(.):::fake
  n6-.-> n9(.):::fake
  n7-.-> m1(.):::fake
  n7-.-> m2(.):::fake
  *(Root) -->|1| n1(.)
  n1 -->|10| B:::endnode
  B -.-> m3(.):::fake
  B -.-> m4(.):::fake
  n1 -->|11| n2(.)
  n2 --> |110| C:::endnode
  n2 --> |111| D:::endnode
  classDef fake fill:#ddd;

For example in the tree example above, node $A$ has 4 nodes corresponding to it at depth = 3, while node $B$ has 2 nodes.

It is clear that the nodes at depth $l_{ma x}$ are distinct for each of the leaf nodes $n_{i}$ (Think why?). As the "total number of nodes at depth $l_{ma x}$ ", is larger than "the sum of nodes at depth $l_{ma x}$ corresponding to leaf nodes $n_{i}$ , we get the inequality

$i = 1 \sum k 2^{l_{ma x} - l_{i}} \leq 2^{l_{ma x}}$

This completes the proof sketch for the Kraft Inequality:

$i = 1 \sum k 2^{- l_{i}} \leq 1$

Well, that was a short and simple proof! It is also clear that the equality is true, if and only if there is no leaf node left unaccounted for.

We can use the Kraft inequality to now show the correctness of the prefix-free tree construction with code-lengths $l_{i}$ , as we discussed in last lecture.

Prefix-tree construction correctness

To recap, our prefix-free tree construction proceeds as follows:

We are given probability distribution $p_{1}, p_{2}, \dots, p_{k}$ for symbols $s_{1}, s_{2}, \dots, s_{k}$ . WLOG assume that: $p_{1} \geq p_{2} \geq ... \geq p_{k}$ Now, compute code-lengths $l_{1}, l_{2}, \dots, l_{k}$ such that: $l_{i} = ⌈ lo g_{2} \frac{1}{p _{i}} ⌉$ Thus, the lengths, $l_{1}, l_{2}, \dots, l_{k}$ satisfy $l_{1} \leq l_{2} \leq ... \leq l_{k}$
The prefix-free tree construction follows by starting with an empty binary tree, and then recursively adding a leaf node at depth $l_{i}$ to the binary tree.

We want to argue the correctness of the tree construction at each step, i.e. we want to show that when we are adding node $n_{k}$ , there will always be a node available for us to do so.

Let's proceed towards showing this inductively.

In the beginning, we just have the root node, so we can safely add the node $n_{1}$ with length $l_{1}$ . To do so, we need to create a binary tree with $2^{l_{1}}$ leaf nodes, and just assign one of the leaf nodes to node $n_{1}$
Now, let's assume that we already have a binary tree $T_{r - 1}$ with nodes $n_{1}, n_{2}, \dots, n_{r - 1}$ and that we want to add node $n_{r}$ . We want to argue that there will always be a leaf node available with depth $l_{r}$ in the binary tree $T_{r - 1}$ . Let's see how can we show that:
If we look at the code-lengths, i.e. the depths of the nodes $l_{i}$ , we see that they follow the Kraft inequality $i = 1 \sum k 2^{- l_{i}} = i = 1 \sum k 2^{- ⌈ l o g_{2} \frac{1}{p _{i}} ⌉} = i = 1 \sum k 2^{⌊ l o g_{2} p_{i} ⌋} \leq i = 1 \sum k 2^{l o g_{2} p_{i}} \leq i = 1 \sum k p_{i} = 1$ Now as $\sum_{i = 1}^{k} 2^{- l_{i}} \leq 1$ it implies that the node depths of the tree $T_{r - 1}$ satisfies $\sum_{i = 1}^{r - 1} 2^{- l_{i}} < 1$ (for $r <= k$ )
We know from the Kraft inequality that if the inequality is not tight, then there will be a leaf node available at $max_{i = 1}^{r - 1} l_{i} = l_{r - 1}$ depth. Now, as $l_{r} \geq l_{r - 1}$ , we can safely say that the node $n_{k}$ can be added to the binary tree $T_{r - 1}$ .
This completes the correctness proof.

In fact if you look at the proof, all we have used is the fact that $\sum_{i = 1}^{k} 2^{- l_{i}} \leq 1$ . Thus, the same proof also gives us the following converse theorem of the Kraft inequality:

Theorem-2: Converse of Kraft Inequality

Let $l_{i} \in N$ such that: $i = 1 \sum k 2^{- l_{i}} \leq 1$

then, we can always construct a binary tree with $k$ leaf nodes at depths $l_{i}$ from the root node for $i \in {1, \dots, k}$ .

Kraft inequality and the thumb rule

In the last chapter we introduced the thumb rule $l_{o pt ima l} (sy mb o l) \approx lo g_{2} \frac{1}{p ( sy mb o l )}$ without any justification. Since then we have seen a code construction that gets close to this thumb rule. Here we briefly sketch how Kraft inequality can be shown to justify the thumb rule. Details can be found in section 5.3 of extremely popular book by Cover and Thomas "Elements of Information Theory". The idea is to consider the optimization problem $min i = 1 \sum k p_{i} l_{i} subject to i = 1 \sum k 2^{- l_{i}} \leq 1$ This is simply the optimization problem to minimize the expected code lengths for a prefix code (Kraft's inequality gives a mathematical way to express the constraint). While this is a integer optimization problem (due to the code lengths being integral) and hence is hard to solve, we can relax the integer constraint and try to solve this for any positive $l_{i}$ . Then the method of Lagrange multipliers from convex optimization can be used to sovle this problem and obtain the optimal code lengths $l_{i} = - lo g p_{i}$ . In the next chapter, we will look again into this and derive the thumb rule in a different way (though the Kraft inequality still plays a crucial role).