Prefix(-Free) Codes
As a recap, in the previous lecture, we discussed a simple example of a prefix-free code. We also discussed a simple procedure for decoding data encoded using the prefix-free code. In this lecture we will be thinking about how to actually go about implementing the decoding, and also how to design prefix-free code itself.
Lets start by thinking about the decoding. As a recap, here is our prefix-free code and the decoding procedure
Symbol | Codewords |
---|---|
A | 0 |
B | 10 |
C | 110 |
D | 111 |
### Prefix-free decoding procedure
codewords_table = {A: 0, B: 10, C: 110, D: 111}
def decode_symbol(encoded_bitstring, codewords_table):
temp_bits = ""
# read until we find a match with a codeword in the codewords_table
while not find_match(temp_bits, codewords_table):
temp_bits += encoded_bitstring.read_next_bit()
decoded_symbol = find_match(temp_bits, codewords_table)
return decoded_symbol
The procedure is as follows:
- We start from the encoded bitstream, and try to find a match with either of the codewords. We stop when we get our first match
- Note that the decoding procedure above is "correct" because we have a prefix-free code, and so we can stop searching
when
find_match
obtains its first match.
Let's think about how we can implement the find_match
function.
Idea-1: Hash-tables: One simple way to implement the find_match
function in the decoding is via a hash table. We
can just create a hash table using the codewords_table
. We can then just query the hash table with the temp_bits
,
until we get a match.
Can we think of a simple data-structure which will make this procedure faster?
Prefix-free Tree
Another approach towards prefix-free code decoding is to construct a prefix-free tree and use this tree for decoding.
Lets get back to our prefix-free
code example to see what I mean by that:
Given the codewords table, we can represent it as a binary tree follows:
graph TD *(Root) -->|0| A:::endnode *(Root) -->|1| n1(.) n1 -->|10| B:::endnode n1 -->|11| n2(.) n2 -->|110| C:::endnode n2 -->|111| D:::endnode
- The idea behind the prefix-free tree construction is simple. For each codeword, we add a node at depth
len(codeword)
from root node, taking the right path or the left path from previous node in the tree depending on whether the corresponding bit in the codeword is1
or0
. E.g. in our codewords tableB -> 10
. In this case we add a node to the binary tree at depth2 = len(10)
, corresponding to the path10 -> right, left
from the root node. Similarly, forC -> 110
, we add a node at depth3
corresponding to the path110 -> right, right, left
. - Notice that for prefix-free codes, the codewords correspond to the leaf nodes. This can be shown using contradiction. If there was another node
n2
corresponding to codewordc2
sprouting out of noden1
(with codewordc1
), then based on the construction of prefix-free tree defined in the previous step,c1
is a prefix ofc2
. This is a contradiction because we're violating theprefix-free
property of our code. In fact, the property that prefix-free codes correspond to the leaf nodes of prefix-free tree is another way to define prefix-free codes!
Okay, now that we have understood how to create the prefix-free tree data structure for any prefix-free code, can we use this tree structure to improve the decoding algorithm? Take a look!
## Prefix-free code decoding using a prefix-free tree
prefix_tree = ... # construct the tree here using codewords
def decode_symbol(encoded_bitarray, prefix_tree):
# start from the root node and read bits until you read a leaf node
node = prefix_tree.root
while not node.is_leaf_node:
bit = encoded_bitarray.read_next_bit()
node = node.right if bit else node.left
# return the symbol corresponding to the node, once we reached a leaf node
return node.symbol
Some observations:
- This decoding scheme is similar in logic to the previous one, but quite a bit more efficient, as we are not querying the hash table multiple times.
- The key idea is that for prefix-free codes the codewords correspond to leaf nodes. Hence, we just parse the tree with the bits from the output until we reach one of the leaf nodes.
- Note that we need to create the prefix-free tree during the pre-processing step of our decoding as a one-time cost, which typically gets amortized if the data we are encoding is quite big.
As we will see later the prefix-free tree is not just useful for efficient decoding, but is a great way to think and visualize prefix-free codes. We will in fact use the structure when we learn about how to design a good prefix-free code.
How to design good prefix-free codes?
Okay, now that we have convinced that prefix-free codes are indeed lossless and that they have an efficient decoding scheme, lets revisit our code and think again about why the scheme leads to better compression
To recap: we wanted to design a code for the skewed non-uniform distribution:
we started with the following Fixed Bitwidth code.
Symbol | Codewords |
---|---|
A | 00 |
B | 01 |
C | 10 |
D | 11 |
Then we were able to construct a variable-length code, which is indeed lossless and improves the compression
from 2 bits/symbol
to 1.53 bits/symbol
on an average.
Symbol | Codewords |
---|---|
A | 0 |
B | 10 |
C | 110 |
D | 111 |
How did this improvement come about?
-
The Fixed Bitwidth code is assigning equal importance to all the symbols , as the code-lengths are the same (
2 bits/symbol
). This seems all good, in case they are equiprobable. i.e. if the probabilities are: -
But in case of the skewed probability distribution, clearly are more important as they occur much more frequently. So, we should try to assign a shorter codeword to and could afford to assign longer codewords to as they occur much less frequently.
Thus we have a simple and somewhat obvious thumb rule for a code:
Given a distribution, it is better (in terms of compression) to assign shorter codewords to symbols with higher probability.
Even though we want to assign shorter codes to symbols with higher probability, it is not clear what the proportionality should be.
For example, the prefix-free code we have works well for distribution and gives us the average codelength of 1.53 bits/symbol
. But, the same code doesn't work well for a less
skewed distribution like:
as in this case the average codelength is 2.1 bits/symbol
(which is even higher than Fixed Bitwidth code!).
This problem, (and much more) was definitively analyzed in the brilliant work by Claude Shannon in his 1948 paper A Mathematical theory of Communication. Shannon's work laid the theoretical foundations of not just the field of data compression, but more so of the area of error-free data communication. (The paper remains surprisingly accessible and short for its legendary nature. Do take a look!)
We will definitely come back in the future lectures to understand the work in more detail, but here is another thumb rule from Shannon's work for the optimal code-lengths:
Note that this is still a thumb rule, until we prove it and show convincingly in which scenarios it holds true. But, lets try to verify this rule in a couple of simple cases:
-
Uniform Distribution: Let's consider the case of We know that as we have 4 equiprobable symbols, we can encode data using
2 bits/symbol
. This matches the thumb rule above: In general, we see that if the probability is for all the symbols of a distribution, then the optimal codelength is going to be , according to the thumb rule. Now, in cases where is not a power of , and if we have a single unique codeword per symbol, then the best we can achieve is -
Dyadic distribution: Based on the thumb rule, we need to be an integer, which is only possible if is a negative power of for all symbols. Such distributions are known as dyadic distributions
Dyadic distribution A distribution is called dyadic if
Based on the thumb rule, it is unambiguously clear that for dyadic distribution, the symbol should have codeword of length . For example: for the following dyadic distribution the code-lengths are going to be:
Symbol Prob Optimal codelength A 1/2 1 B 1/4 2 C 1/8 3 D 1/8 3 -
General Distributions: For generic distributions, , might not be achievable in general. But, a good goal to aim for is: Is this possible? Lets look at this problem in the next section
Designing prefix-free codes
Okay! So, now that we know that the code-lengths to shoot for are: let's try to think how.
Let's take a simple example (see below) and see if we can try to come up with a prefix-free code with the prescribed lengths.
Symbol | Prob | Optimal codelength |
---|---|---|
A | 0.55 | 1 |
B | 0.25 | 2 |
C | 0.1 | 4 |
D | 0.1 | 4 |
We know that all prefix-free codes have a corresponding prefix-free tree. So, essentially we want to come up with a
binary tree with the leaf nodes at a distance equal to the code-lengths from the root node. For example, in the above
example, we want to construct a binary tree with leaf nodes at distance 1, 2, 4, 4
from the root node. Note that there
can be some additional leaf nodes to the binary tree which are not assigned to any codeword.
- Let's start with nodes at distance
1
. We know a binary tree has2^1 = 2
nodes at distance1
from the root node. These correspond to codewords0
(left node) and1
(right node). Let's make the left node a leaf node corresponding to symbol (with codelength =1
as needed)
graph TD *(Root) -->|0| A:::endnode *(Root) -->|1| n1(.)
- Now we have the right node (corresponding to codeword
1
), which we can split further into two nodes corresponding to10
and11
. As we are needed to assign a codeword with length2
, lets assign node corresponding to10
to symbol , and make it a leaf node.
graph TD *(Root) -->|0| A:::endnode *(Root) -->|1| n1(.) n1 -->|10| B:::endnode n1 -->|11| n2(.)
- We are now again left with the right node
11
, which we can split further into two nodes110, 111
at distance =3
from the root node. Now, looking at the table of code-lengths, we do not have any code to be assigned to length3
, so lets split node110
further into nodes corresponding to1100
and1101
. We can now assign these two nodes to symbols respectively.
graph TD *(Root) -->|0| A:::endnode *(Root) -->|1| n1(.) n1 -->|10| B:::endnode n1 -->|11| n2(.) n2 --> |110| n3(.) n2 --> |111| n4(.) n3 --> |1100| C:::endnode n3 --> |1101| D:::endnode
Thus, our final codewords are:
Symbol | Prob | Optimal codelength | codewords |
---|---|---|---|
A | 0.55 | 1 | 0 |
B | 0.25 | 2 | 10 |
C | 0.1 | 4 | 1100 |
D | 0.1 | 4 | 1101 |
Notice that the leaf node corresponding to 111
was not linked to any symbol. That is absolutely ok! It just implies
that we could have squeezed in another symbol here with length 3
, and points to the slight inefficiency in using the
approximation .
The procedure we followed seems general enough that we can follow it for any distribution:
- Sort the distribution in descending order and compute codelengths as
- Assign the first lexicographically available codeword to the next symbol which is not a prefix of the previous ones and is of the stipulated length.
We still need to argue the correctness of the procedure and that it will work in all cases! For example, how can we be sure that we will not run out of nodes?
We will look at all these problems in the next lecture.
Summary & Next Lecture
To summarize: here are the key takeaway's for this lecture:
-
Prefix-free codes: Among the symbol codes, prefix-free codes allow for efficient and convenient encoding/decoding. In fact as we will see, given any uniquely-decodable code (or lossless code), we can construct a prefix-free code with exact same code-lengths and thus the same compression performance. Thus, we restrict ourselves to the analysis of prefix-free codes among the symbol codes.
-
Prefix-free tree: Any prefix-free code has a prefix-free binary tree associated with it, which has the leaf nodes corresponding to the codewords.
-
Compression thumb rule: One, somewhat obvious thumb rule we learnt for symbol codes was that if else we could swap the codewords for symbols and achieve lower average codelength.
-
Neg Log-likelihood: Another thumb-rule we learnt was that the optimal code-lengths for a prefix-free code are We will understand why this rule is true in the next lecture.
-
Prefix-free code construction: We looked at a procedure for constructing prefix-free codes for any distribution. In the next lecture, we will justify the correctness of the procedure and also discuss a couple of more ideas of constructing prefix-free codes