Lossy compression theory

We continue with lossy compression and rate distortion function (Shannon's RD theory) and applications in current technologies.

We'll start with things that seem unrelated but we'll bring it all together towards the end. We'll only touch on some of the topics, but you can learn more in the references listed below and in EE 276.

We'll start with the rate distortion function and see how it carries over to sources with memory. We'll also look into Gaussian sources with memory. Finally, we'll look at the implications to transform coding which is commonly used today.

Reminder from linear algebra:

Consider $Y = A X$ (matrix vector product)

Then the square of the Euclidean norm (sum of square of components), also denoting the energy in the signal is $∣∣ Y ∣ ∣^{2} = Y^{T} Y = X^{T} A^{T} A X$

In particular, if $U$ is a unitary transformation, i.e., all rows and columns orthonormal vectors $U^{T} U = U U^{T} = I$ , then we have $Y = U X => ∣∣ Y ∣ ∣^{2} = ∣∣ X ∣ ∣^{2}$ This is called the Parseval's theorem which you might have seen for Fourier transform. In words, this says that the energy in transform domain matches the energy in the original.

If $Y_{1} = U X_{1}$ and $Y_{2} = U X_{2}$ , then $∣∣ Y_{1} - Y_{2} ∣ ∣^{2} = ∣∣ X_{1} - X_{2} ∣ ∣^{2}$ . That is to say, unitary transformation preserves Euclidean distances between points.

Lossy compression recap

Recall the setting of lossy compression where the information is lossily compressed into an index (equivalently a bit stream representing the index). The decoder attempts to produce a reconstruction of the original information.

The two metrics for lossy compression are:

$r a t e = \frac{l o g N}{k}$ bits/source component
distortion $d (X^{k}, \hat{X}^{k}) = \frac{1}{k} \sum_{i = 1}^{k} d (X_{i}, \hat{X}_{i})$ [single letter distortion - distortion between k-tuples defined in terms of distortion between components]

Transform coding

Notation: we denote $X^{k} = (X_{1}, \dots, X_{k})$ as $\underline{X}$ which can be thought of as a column vector.

Here we simply work with an arbitrary transform $T$ , with the only requirement being that $T$ is invertible and we are able to efficiently compute $T$ and $T^{- 1}$ . In this framework, we simply apply our usual lossy encoding in the transform domain rather than in the original domain.

In particular, when $T (X) = U X$ for some some unitary $U$ (e.g., Fourier transform, wavelet transform). Then $∣∣ Y - \hat{Y} ∣ ∣^{2} = ∣∣ X - \hat{X} ∣ ∣^{2}$ This corresponds to the squared-error distortion. Any lossy compression you do on $Y$ , you get the same square error distortion for the original sequence $X$ as for the $Y$ .

Why work in the transform domain? Often in the transform domain, data is simpler to model, e.g., we can construct transform in a way that the statistics of $Y$ are simpler or we get sparsity. Then we can appropriately design lossy compressor to exploit the structure

Nowadays people even go beyond linear transforms, e.g., learnt transforms using deep learning models. Can even go to a vector in a smaller dimensional space, e.g., in VAE based lossy encoders. This can allow doing very simple forms of lossy compression in the transform domain.

Shannon's theorem recap

For "memoryless sources" ( $X_{i}$ are iid ~ $X$ ),

$R (D) = mi n_{E [d (X, \hat{X})] <= D} I (X; \hat{X})$

We sometimes write $R (X, D)$ to represent this quantity $R (D)$ when we want to be explicit about the source in question.

Beyond memoryless sources

Consider source $X^{n}$ , reconstruction $\hat{X}^{n}$ . Then,

$R (X^{n}, D) = mi n_{E [d (X^{n}, \hat{X}^{n})] \leq D} \frac{1}{n} I (X^{n}; \hat{X}^{n})$

Just like $R (X, D)$ was the analog of entropy of $X$ , $R (X^{n}, D)$ is the analog of entropy of the n-tuple.

Now assume we are working with a process $X = X_{1}, X_{2}, X_{3}, ...$ which is stationary. Then we can define $R (X, D) = lim_{n \to \infty} R (X^{n}, D)$ . Similar to our study of entropy rate, we can show that this limit exists.

Shannon's theorem for lossy compression carries over to generality. That is, the best you can do for stationary processes in the limit of encoding arbitrarily many symbols in a block is $R (X, D)$ .

Rate distortion for Gaussian sources

Note: For the remainder of this discussion, we'll stick to square error distortion.

Why work with Gaussian sources? It is a good worst case assumption if you only know the first and second order statistics about your source. This holds both for estimation and lossy compression.

For $X N (0, σ^{2})$ , denote $R (X, D)$ by $R_{G} (σ^{2}, D)$ .

Recall from last lecture that $R_{G} (σ^{2}, D) = \frac{1}{2} lo g \frac{σ ^{2}}{D}$ for $D < σ^{2}$ (above it is just $0$ ).

We can compactly write this as $R_{G} (σ^{2}, D) = [1/2 lo g \frac{σ ^{2}}{D}]_{+}$ , where $[x]_{+} = ma x 0, x$ . This is shown in the figure below.

Similarly for $X_{1} \sim N (0, σ_{1}^{2})$ , $X_{2} \sim N (0, σ_{2}^{2})$ independent, denote $R (X^{2}, D)$ by $R_{G} ([σ_{1}^{2} σ_{2}^{2}], D)$ .

It can be shown that $R_{G} ([σ_{1}^{2} σ_{2}^{2}], D) = mi n_{\frac{1}{2} (D_{1} + D_{2}) \leq D} \frac{1}{2} [R_{G} (σ_{1}^{2}, D_{1}) + R_{G} (σ_{1}^{2}, D_{2})]$ Another way to write this is $R_{G} ([σ_{1}^{2} σ_{2}^{2}], D) = mi n_{\frac{1}{2} (D_{1} + D_{2}) \leq D} \frac{1}{2} [(\frac{1}{2} lo g \frac{σ _{1}^{2}}{D _{1}})_{+} + (\frac{1}{2} lo g \frac{σ _{2}^{2}}{D _{2}})_{+}]$

Intuition: the result is actually quite simple - the solution is just greedily optimizing the $X_{1}$ and $X_{2}$ case (decoupled), and finding the optimal splitting of the distortion between $X_{1}$ and $X_{2}$ .

Using convex optimization we can show that the minimum is achieved by a reverse water filling scheme, which is expressed in equation as follows:

For a given parameter $θ$ , a point on the optimal rate distortion curve is achieved by setting

$D_{i} = min {θ, σ_{i}^{2}}$ for $i = 1, 2$
$D = \frac{1}{2} (D_{1} + D_{2})$

And the rate given by $\frac{1}{2} [(\frac{1}{2} lo g \frac{σ _{1}^{2}}{D _{1}})_{+} + (\frac{1}{2} lo g \frac{σ _{2}^{2}}{D _{2}})_{+}]$

This can be expressed in figures as follows (assuming without loss of generality that $σ_{1}^{2} < σ_{2}^{2}$ ):

When $D$ is smaller than both $σ_{1}^{2}$ and $σ_{2}^{2}$ , we choose both $D_{1}$ and $D_{2}$ to be equal to $D$ ( $θ = D$ in this case). We assign equal distortion to the two components, and higher rate for the component with higher variance.

When $D$ exceeds $σ_{1}^{2}$ but is below $\frac{1}{2} (σ_{1}^{2} + σ_{2}^{2})$ , we set $D_{1}$ to be $σ_{1}^{2}$ , and choose $D_{2}$ such that the average distortion is $D$ . The idea is that setting $D_{1}$ higher than $σ_{1}^{2}$ doesn't make sense since the rate is already $0$ for that component.

When $D$ is equal to $\frac{1}{2} (σ_{1}^{2} + σ_{2}^{2})$ we can just set $D_{1} = σ_{1}^{2}$ and $D_{2} = σ_{2}^{2}$ . Here the rate is $0$ for both components!

This generalizes beyond $2$ components. For $X_{1}, X_{2}, ..., X_{n}$ independent with $X_{i} \sim N (0, σ_{i}^{2})$ , we define $R_{G} (\underline{σ^{2}}, D)$ analogously, and can very similarly show that $R_{G} (\underline{σ^{2}}, D) = mi n_{\frac{1}{n} \sum D_{i} \leq D} \frac{1}{n} [\frac{1}{2} l o g \frac{σ ^{2}}{D _{i}}]_{+}$ .

Similar to before, the minimum is given by $D_{θ} = \frac{1}{n} \sum_{i = 1}^{n} min {θ, σ_{i}^{2}}$ , $R_{θ} = \frac{1}{n} \sum_{i = 1}^{n} [\frac{1}{2} lo g \frac{σ _{i}^{2}}{D _{i}}]$ .

Rate-distortion for stationary Gaussian source

Going back to a process $X^{n}$ zero mean Gaussian, then for any unitary transformation $U$ if $Y^{n} = U X^{n}$ then we can show $R (X^{n}, D) = R (Y^{n}, D)$ [since the distortion is the same in both domains]. Recall that by using the transformation it's possible to go from a scheme for compressing $X^{n}$ to a scheme for compressing $Y^{n}$ (and vice versa) without any change in the distortion.

Therefore we can take the diagonalizing unitary matrix which converts $X^{n}$ to a $Y^{n}$ such that $Y^{n}$ has independent components. The variances of $Y^{n}$ will be the eigenvalues of the covariance matrix.

Thus, we have

$R (X^{n}, D) = R_{G} ((λ_{1}, \dots, λ_{n}), D)$ where the $λ_{i}$ 's are the eigenvalues of the covariance matix of $X^{n}$ .

When $X^{n}$ are the first $n$ components of a stationary Gaussian process $X$ with covariance matrix $Φ_{n} = {ϕ_{∣ i - j ∣}}$ for $1 \leq i \leq n$ and $1 \leq j \leq n$ , with $ϕ_{k} = C o v (X_{i}, X_{i - k})$ . Then we have $R (X^{n}, D) = R_{G} (λ^{n}, D)$ where $λ^{n}$ is the vector of eigenvalues of $Φ_{n}$ .

Now, we use a theorem to show a profound result for Gaussian processes.

Theorem (Toeplitz distribution) Let $S (ω) = \sum_{k = - \infty}^{\infty} ϕ_{k} e^{- jωk}$ be the spectral density of $X$ and $G$ be a continuous function. Then $n \to \infty lim \frac{1}{n} i = 1 \sum n G (λ_{i}^{(n)}) = \frac{1}{2 π} \int_{- π}^{π} G (S (ω) d ω$

Specializing this theorem to $G (λ) = min {θ, λ}$ and to $G (λ) = [\frac{1}{2} lo g \frac{λ}{θ}]_{+}$ , we get

The rate distortion function of a stationary Gaussian process with spectral density $S (ω)$ is given parametrically by $D_{θ} = \frac{1}{2 π} \int_{- π}^{π} min {θ, S (ω)} d ω$ $R_{θ} = \frac{1}{4 π} \int_{- π}^{π} [lo g \frac{S ( ω )}{θ}]_{+} d ω$

This is shown in the figure below, suggesting that the reverse water-filling idea extends to Gaussian processes once we transform it to the continuous spectral domain! This gives us motivation for using working in the Fourier transform domain!

Finally, for $D \leq min_{ω} S (ω)$ , we can show that $R (D) = [\frac{1}{2} lo g \frac{σ ^{2}}{D}]$ where $σ^{2}$ is the variance of the innovations of $X$ . This can be used to justify predictive coding ideas.

Reference

For more details on this, you can read the survey paper "Lossy source coding" by Berger and Gibson available at https://ieeexplore.ieee.org/document/720552.