Rate Distortion Theory
Entropy, Conditional entropy Recap:
In case of lossless compression we saw information theoretic quantities such as entropy: , conditional entropy: etc.
To recap:
-
Entropy: Let be a random variable, with alphabet and discrete probability distribution . i.e. one can imagine the samples generate by the random variable to be independent and identically distributed as per distribution .
Then the entropy of the random variable is defined as:
-
Joint Entropy: The joint entropy of discrete random variables with distribution is simply the entropy of the joint random variable .
- Conditional Entropy/Relative Entropy The conditional entropy between two discrete random variables is defined as:
Note that has alternative definitions:
i.e. it is the average of the entropies
Mutual Information
For today's discussion another important information theoretic quantity which would be interesting to us is the mutual information
Let be two random variables with joint distribution . Then we define the mutual information between as:
One intuitive explanation for mututal information is that it is the difference between sum of individial entropies and the joint entropy between two random variables , and so in a way capture how much information is common between .
Mutual information has some nice properties, which we will use:
- property-1: : It is clear from the symmetry that mutual information between is equal to mutual information between
-
property-2: : This can be shown using the property of joint entropy: .
-
property-3: :
-
property-4: : This follows from the non-negativity of the and the property-3.
Mutual Information has a very important connection to lossy compression, but even beyond that in Information theory in general. For more information look at the EE276 course notes on communication capacity
Lossy Compression setup
Let us recap the lossy compression setup, and now that we are going to discuss this in detail, let us define it more concretely.
Lets say we are given a sequence of random variables , Our goal is to encode this sequence to bits, using a lossy encoder. We also decode back the bits to reconstruction , using a lossy decoder.
## Encoding
X_1,X_2, ..., X_k ====> [LOSSY ENCODER] ==> 0100011...1 (n = log2(N) bits)
## Decoding
0100011...1 (N bits) ======> [LOSSY DECODER] ==>
Y_1, Y_2, \ldots, Y_k
At this point, we need a few more terms to understand the performance of our lossy compressor:
- Rate : the compression rate is defined as, the average number of bits used per source symbols:
- distortion : As the compression is not lossless, we also need to know how far the reconstruction is from the input .
For simplicity lets stick to per-symbol distortion like mean square error, or hamming distortion. Thus,
For example:
- Average Distortion We mainly care about the average distortion: over the random variables .
Rate-distortion function
One interesting question to answer is: "What is the best rate R we can achieve for distortion at max D"? i.e. If the target distortion of our lossy compressor is , what is the best we can compress the data to? We define this optimal rate to be , the rate-distortion function.
Thus:
This is the precise problem Shannon solved.
Let be data generated i.i.d. Then, the optimal rate for a given maximum distortion is:
where the expectation in the minimum is over distributions , where are any arbitrary conditional distributions.
Here are some examples:
Example-1: for bernoulli r.v:
Let . and let i.e the Hamming distortion. Then:
where -> binary entropy function of .
The formula is not very intuitive. But, it is clear that:
-
decreases as increases; this should be expected as the bits required should reduce if the allowed distortion is increasing.
-
if ; This is also quite intuitve as if allowed distortion is , we can always just decode all zeros . For all zeros, the average distortion is .
Example-2: for gaussian r.v.:
Let's take a look at another example: Let , i.e. the data samples are distributed as unit gaussians. Also, lets consider the distortion to be the mean square distortion: i.e the mse distortion. Then:
Let's try to intuitively understand why this is the case:
- If 's are i.i.d , then it is clear that with high probability:
- This mainly follows from the law of large numbers, and that the variance of is . Thus, we can say that will lie in a k-dimensional sphere of radius , with high probability.
For the remaining discussion, we can thus focus our attention on vectors , only lying inside this sphere of radius .
- Now, lets say we want to cover this k-dimensional sphere of radius with k-dimensional spheres of radius . How many spheres do we need?
We can approximate this number based on the volumes of the spheres:
Although this approximation might feel very loose, as the dimension increases, it can be shown that the number of spheres of radius required to cover the sphere of radius , is indeed approximately equal to:
-
Note that we can use these spheres of radius , as centroids for vector quantization, as we saw in the last lecture, and we would get a distortion of at max , as the squared distance between any point in the sized circle is at max with one of the centroids.
-
Thus our , the rate for distortion at maximum is:
Hope this hand-wavey "proof" gives an intuition for the function for unit gaussian. The proof logic can however be made more precise.
NOTE: Similar idea proof holds for general distributions, using typical sequences balls. We won't be able to go much into the details of the Shannon's lossy compression theorem in the course unfortunately, but here are lecture notes in case you are interested: EE376a Lossy compression notes
We can also experiment with this idea, here is the R-D curve for unit gaussian and the practical performance in . We see that the R-D performance even with is quite reasonable.
We can also see the convergence as increases:
## Rate: 1, optimal_mse: 0.25
k: 1, N: 2, Rate: 1, mse_loss: 0.37
k: 2, N: 4, Rate: 1, mse_loss: 0.36
k: 4, N: 16, Rate: 1, mse_loss: 0.33
k: 8, N: 256, Rate: 1, mse_loss: 0.29
...
Achieving the R(D) in general
We saw briefly how we can achieve the optimal function using vector quantization for data distributed as i.i.d unit gaussians.
The Broad idea of using vector quantization can be actually shown to asymptotically optimal for any data distribution. i.e. as the dimension of data increases, using vector quantization, we can achieve optimal performance.
Although the convergence w.r.t can be slow. In the next lecture we will see how we can accelerate this convergence.