EE274 (Fall 25): Homework-1

Focus area: Prefix-free codes, Basic Information theory
Due Date: Oct 14, midnight (11:59 PM)
Weightage: 15%
Total Points: 135 (Written 100 + Code 35)
Submission Instructions:
- Provided at the end of HW (ensure you read these!)
- We will be using both manual and auto-grading for the assignment submissions. Please follow the submission instructions carefully to ensure you get full credit for your submissions.
Submission link:
- For written part: HW1-Written (100 points)
- For programming part: HW1-Code (35 points)
Please post any questions on Ed, or feel free to contact the instructors during the office hours. We will be supporting Linux and Mac OS.
- We strongly recommend Windows users to use Windows Subsystem Linux (WSL). Refer to this instruction for setup. Follow the steps up to Update and upgrade packages. If you are using VS Code, continue with Set up your favorite code editor - Use Visual Studio Code.

NOTE: The homework looks longer that it actually is! We try to be verbose with the questions for clarity.

0. Python, SCL setup (10 points)

We will be using python as the programming language during the course for the assignments and for code demonstrations during the course. If you are not familiar with python, please take a look at the python tutorial contributed by our friends from the CS231n course.

Usually it is more convenient and safer to install python packages in a virtual environment. We will be using conda to do this

Install miniconda (https://docs.conda.io/en/latest/miniconda.html). Please follow the instructions to install miniconda on your machine
Create a virtual conda env using the python version 3.9 (higher versions should be also ok)
```
# create a new conda environment
conda create --name ee274_env python=3.9

# activate the new conda environment
conda activate ee274_env
```
Once you complete this step, you should see a prompt such as:
```
(ee274_env) ➜   
```
This indicates that you are now in a virtual environment we created, which uses python 3.9. The virtualenv is useful to install packages etc. in an isolated environment without corrupting the rest of the machine. More details: https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/

Please run the following command to check the python version installed. Submit the output of this command in gradescope for credit.
```
(ee274_env) ➜ python --version
```
For most of the course assignments and in-class demonstrations, we will be using the Stanford Compression Library (SCL).

Please clone the library as follows:
```
git clone https://github.com/kedartatwawadi/stanford_compression_library.git
cd stanford_compression_library
```
Install the library. This installs the SCL library in an editable way so you do not need to reinstall when you modify any files. This also installs any dependencies of SCL.
```
pip install -e .
```
Once this is done let us ensure the library is installed correctly by running unit tests. Submit the output of this command in gradescope for credit. The results may include both passed and xfailed tests, but there should be no failed tests if you run the command on the main branch.
```
find scl -name "*.py" -exec py.test -s -v {} +
```

NOTE: We use the pytest framework to run tests in SCL. Each file of SCL also contains the tests for the block of code implemented in the file. The test functions are named as def test_xxx(...). For example, the file core/data_block.py contains a class DataBlock and a test test_data_block_basic_ops() to test its operations. Usually these tests are a great way to understand how the block of code in the file works. You can run individual tests as:

py.test -s -v scl/core/data_block.py

Feel free to look at different compressors in the compressors/ folder, and the tests to get yourself familiar with them!

We will be using EE274_Fall25/HWs branch of SCL for releasing HWs. Checkout to the branch for getting access to HW problem templates.
```
git checkout EE274_Fall25/HWs
```

You should see a scl/HWs/HW1 folder containing template files for the rest of the assignment. Going ahead, we will refer to files relative to scl folder in your stanford_compression_library folder. For example, HWs/HW1/hw1_p1.py refers to the file stanford_compression_library/scl/HWs/HW1/hw1_p1.py.

For more details about SCL and to get comfortable with it, we also created a SCL tutorial. You can find the SCL tutorial linked here.

1. Probability and Coin Toss (20 points)

Kedar is a big Cricket fan. In cricket (like many other sports), there is a coin toss in the beginning to determine which team gets the first chance to bat. However, Kedar thinks that the coin being used is biased and lands as Heads (H) much more often than tails (T). Let the probability of the coin landing on H be $p$ .

Shubham suggests to Kedar that this problem can be solved (i.e. we can get a fair coin) by tossing the coin twice! If we get a H,Tsequence in the two tosses, then we mark it as a heads, while if we get a T,H sequence we can mark this as a tails. If we get a H,H or T,T then discard the result and start over!

[5 points] Do you think Shubham's solution is right? Please justify.
[5 points] What is the expected number of coin flips needed to get one H or T outcome using Shubham's method?
[5 points] Kedar is still not convinced and thinks that we should really implement this idea to see if it works. Please help out Shubham by implementing the function shubhams_fair_coin_generator() in the file HWs/HW1/hw1_p1.py. NOTE: please run all of this in the (ee274_env) environment which you created and that you are in the correct branch of the SCL repo (EE274_Fall23/HWs). Ensure that the tests except those in HWs folder pass before you start implementing the function.
[5 points] It is always a good practice to write tests to ensure the implementation is correct (even if you think theoretically it is alright :)). Please implement test_shubhams_fair_coin_generator() function in HWs/HW1/hw1_p1.py. You can test your code by running the following command in stanford_compression_library/scl folder:
```
py.test -s -v HWs/HW1/hw1_p1.py
```
(NOT GRADED, THINK FOR FUN!): Do you think Shubham's solution is optimal? For example, Shubham is using at least 2 biased coin tosses to generate 1 fair coin toss. This seems wasteful in some cases (for example, if the coin is almost unbiased). Can you improve upon the average number of biased coin tosses needed to generate one fair coin toss?

Q2: Basic Information theory (20 points)

[5 points] Let $X$ be a random variable over positive integers with a distribution $P_{X} (X = k) = 2^{- k}, k \geq 1$ Compute the entropy $H (X)$ of random variable $X$ . What are the optimal code-lengths for a prefix-free code designed for distribution $P_{X}$ ?
[5 points] Provide an optimal prefix-free code design for the distribution $P_{X}$ .
[10 points] Let $Y$ be a random variable over positive integers such that $E (Y) = 2$ . Then show that: $H (Y) \leq 2$ For what distribution of the random variable $Y$ does the equality hold, i.e. $H (Y) = 2$ ?

HINT: KL-Divergence and it's properties will be helpful to show this.

Q3. Uniquely decodable codes (25 points)

We mainly explored prefix-free codes in class. But there is a much broader class of codes, the uniquely decodable codes. A code is uniquely decodable if no two sequences (of length 1 or more) of input symbols (say $x^{n}$ and $y^{m}$ ) can produce the same encoding, and hence one can uniquely determine the input sequence from the encoding. In this question, we will show that even with the uniquely decodable codes, we cannot do better than the Entropy.

[5 points] Consider the code C1 below. Is C1 uniquely decodable? Implement a decoder for the code below. Briefly justify your decoding procedure.
```
A -> 10
B -> 00
C -> 11
D -> 110
```

Consider an alphabet $U$ and a uniquely decodable code with code lengths for $u \in U$ denoted by $L (u)$ . Also, we denote the codelength of the n-tuple $u^{n}$ as $L (u^{n}) = \sum_{i = 1}^{n} L (u_{i})$ .

HINT: For parts Q3.2 and Q3.3, you might find it easier to gain intuition by starting with small values of $∣ U ∣$ and $n$ (e.g., binary alphabet with $n = 1$ or $2$ ) and manually expanding the summations.

[5 points] Show that $(u \in U \sum 2^{- L (u)})^{n} = u^{n} \in U^{n} \sum 2^{- L (u^{n})}$
[5 points] Let $l_{ma x} = max_{u \in U} L (u)$ . Then we can rewrite the summation as: $u^{n} \in U^{n} \sum 2^{- L (u^{n})} = j = 1 \sum n \cdot l_{ma x} ∣ {u^{n} ∣ L (u^{n}) = j} ∣ \cdot 2^{- j}$

NOTE: Rewriting summation is a common proof trick, and is a useful one to watch out for! Using Q3.2 and the identity above, show that:

$(u \in U \sum 2^{- L (u)})^{n} \leq n \cdot l_{ma x}$
[5 points] Using Q3.3 show that uniquely decodable codes satisfy Kraft's inequality i.e.

$(u \in U \sum 2^{- L (u)}) \leq 1$

[5 points] Now show that for any uniquely decodable code for $U$ with codeword lengths $L (U)$ , $E [L (U)] \geq H (U) .$ NOTE: We saw a similar proof for prefix-free codes in the class!

Also argue that for any uniquely decodable code, it is possible to construct a prefix-free code with the same codeword lengths.

This means that uniquely decodable codes generally do not provide any benefits over prefix-free codes and instead have a more complicated decoding procedure!

Q4: Shannon Code(s) (25 points)

In class, we saw one version of the Shannon codes; the tree-based construction.

[5 points] Let's call it the ShannonTreeEncoder and the ShannonTreeDecoder. Manually calculate what the codewords should be for the following distributions:

ProbabilityDist({"A": 0.25, "B": 0.25, "C": 0.25, "D": 0.25})
ProbabilityDist({"A": 0.5, "B": 0.25, "C": 0.12, "D": 0.13})
ProbabilityDist({"A": 0.9, "B": 0.1})

[10 points] Complete the code for the ShannonTreeEncoder in hw1_p4.py. Also, complete the test_shannon_tree_coding_specific_case() to check the correctness of your codewords in part 1 against your implemented code. If you encounter errors, you should identify the failing test case and fix any issues. Note: we will be using automated tests to grade your code, so it is important that you do not change the function signatures. We will have more test cases for grading, so feel free to add more test cases of your own of your own when working on the problem.

We have also studied the tree-parsing based decoding for Shannon codes (or prefix-free codes in general). This decoding has already been implemented in the ShannonTreeDecoder. The core tree-parsing logic can be found in here. The tree-parsing based decoder works well however includes too many if/else conditions which leads to branching -- this is pretty bad for modern hardware. (Here is some additional information about this inefficiency for those interested.)

In this next subproblem, we will implement another decoder which we call the ShannonTableDecoder. The ShannonTableDecoder works by creating a decoding lookup table and then using it for decoding symbols without branching.

As an example, lets consider the following simple code:

encoding_table = {"A": 0, "B": 10, "C": 110, "D": 111}

In this case, the ShannonTableDecoder keeps track of data using following tables:

decoding_table = {
    000: "A",
    001: "A",
    010: "A",
    011: "A",
    100: "B",
    101: "B",
    110: "C",
    111: "D",
}
codelen_table = {
    "A": 1,
    "B": 2,
    "C": 3,
    "D": 3,
}
max_codelen = 3

The tables can be then used for decoding as per the pseudo-code below.

def decode_symbol_table(encoded_bitarray):
    state = encoded_bitarray[:max_codelen] 
    decoded_symbol = decoding_table[str(state)]
    num_bits_consumed = codelen_table[decoded_symbol]
    return decoded_symbol, num_bits_consumed

[5 points] Based on the pseudo-code above, explain in a few sentences how the table based decoding works. Why do you need to output both the decoded symbol and the number of bits consumed?
[5 points] Complete the ShannonTableDecoder in hw1_p4.py by implementing the create_decoding_table function. You can verify if your implementation is correct by using test_shannon_table_coding_end_to_end function.

Q5. Huffman coding on real data (30 points)

In class, we have studied compression for sources where we knew the distribution of data beforehand. In most real-life scenarios, this is not the case. Often times, we use the empirical distribution of the data and design a code for this distribution. Then we encode the data with this distribution. For the decoding to work, we also need to transmit the distribution or the code in some way. This problem focuses on compression of a provided file with Huffman coding.

Before we start with the code, let's look at how important it is to transmit the distribution faithfully to achieve lossless compression.

[10 points] Consider the probability distribution represented as a mapping from a symbol to its probability: {A: 0.11, B: 0.09, C: 0.09, D: 0.71}.

a. What are the codeword lengths for the symbols in the Huffman code?

After a harrowing cab experience in the US, Pulkit wants to send a codeword BADCAB to Shubham to describe his frustration. Shubham and Pulkit had agreed to use Huffman encoding for such communications in the past. He encodes the codeword BADCAB using a Huffman code for the original distribution {A: 0.11, B: 0.09, C: 0.09, D: 0.71}, and sends both the codeword and this distribution to Shubham (so that Shubham is able to decode it on his end by building a Huffman tree on his own).

b. Unfortunately, Shubham decodes the message to be CADBAC instead of BADCAB making him confused. What might have gone wrong during this communication between the two?

HINT: notice the specific encoded and decoded symbols in the message.

Pulkit-Shubham realized this issue and fixed it.

c. In the spirit of frugality, Pulkit decides to transmit a rounded distribution with just one significant decimal ({A: 0.1, B: 0.1, C: 0.1, D: 0.7}). What are the codeword lengths for the symbols in the Huffman code for this distribution? Are there multiple possible Huffman code lengths? Why?

These issues suggest that it is important to transmit the distribution faithfully to achieve lossless compression. In addition, the encoder and the decoder must share the same deterministic algorithm to construct the Huffman tree from the distribution.

With this context, look at the provided starter code in hw1_p5.py which does the following:

Loop over the file in blocks of size 50 KB.
For each block, compute the empirical distribution from the data. This simply means we get the counts of each possible byte value ${0, 1, 2, \dots, 255}$ in the block and normalize by the block length to get an approximate empirical probability distribution.
Apply Huffman encoding on the block assuming this empirical distribution.
Encode the empirical probability distribution in bits so that decoder knows what probability distribution to use for decoding.
Final encoding is the encoding from step 4 concatenated with the encoding from step 3.

To understand the overall process, you should go through the provided code and specifically the encode_block and decode_block functions.

[10 points] Implement the encoding and decoding of probability distribution step above (Step 4) in the functions encode_prob_dist & decode_prob_dist respectively. There are multiple ways to encode this, and we are not looking for the most optimized implementation here. Note that we are working with 50 KB blocks here, while the alphabet size is just 256. So as long as your implementation is not absurdly bad, the overhead will be small. You might find the following pointers useful as you implement the function:
- The input bitarray to decode_prob_dist can include more than the encoding of the probability distribution itself. Thus, it should only read as much as it needs and return the number of bits read so that the Huffman decoding can start from the next bit.
- Python dictionaries are OrderedDicts. If you are using dictionaries to represent probability distribution, then it is critical to maintain this ordering while creating Huffman trees during encoding/decoding. See the Pulkit-Shubham communication fiasco above for an example of what can go wrong if the order is not preserved.
- Python's float type is equivalent to the double type in C (see this for a refresher). As long as you provide the same exact distribution to the encoder and decoder, all is well!
  - IMPORTANT: Unless you are feeling particularly adventurous, we strongly recommend using the provided helper functions float_to_bitarray64 and bitarray64_to_float for conversion between float and BitArray.
- Also, you will hear about Canonical Huffman code in class, which provides an alternative to solve some of these issues in practice. You do not need it for this question.

Verify that your implementation is correct by running py.test -s -v HWs/HW1/hw1_p5.py If you encounter errors, you should identify the failing test case and fix any issues. Note: we will be using automated tests to grade your code, so it is important that you do not change the function signatures and that your code passes these tests. We will have more test cases for grading, so feel free to add more test cases of your own when working on the problem. But these should be sufficient to get you started.

[5 points] Now download the Sherlock Holmes novel "The Hound of the Baskervilles by Arthur Conan Doyle" using the following command.
```
curl -o HWs/HW1/sherlock.txt https://www.gutenberg.org/files/2852/2852-0.txt
```
Run the following commands to compress the file using your newly developed compressor, decompress it back, and verify that the decompression succeeds.
```
python HWs/HW1/hw1_p5.py -i HWs/HW1/sherlock.txt -o HWs/HW1/sherlock.txt.huffz
python HWs/HW1/hw1_p5.py -d -i HWs/HW1/sherlock.txt.huffz -o HWs/HW1/sherlock.txt.decompressed
cmp HWs/HW1/sherlock.txt HWs/HW1/sherlock.txt.decompressed
```
Nothing should be printed on the terminal as a result of the last command if the files match.
- Report the size of your compressed file. You can run wc -c HWs/HW1/sherlock.txt.huffz to print the size.
- How much is the overhead due to storing the probability distribution? Note: One easy way to check this to replace encode_prob_dist with a version that just returns BitArray(). Obviously the decoding won't work with this change!
- Print the obtained huffman tree on sherlock.txt by using the print_huffman_tree function (commented in encode_block function of HuffmanEmpiricalEncoder). What do you observe? Does the printed Huffman tree make sense? Why or why not?
[3 points] What happens if you repeatedly use this compressor on the file (sherlock.txt -> sherlock.txt.huffz -> sherlock.txt.huffz.huffz -> ...)? Does the file size keep getting smaller? Why or why not?
[2 points] Now run gzip on the file (command: gzip < HWs/HW1/sherlock.txt > HWs/HW1/sherlock.txt.gz) and report the size of the gzip file (sherlock.txt.gz).

You will observe that gzip does significantly better than the Huffman coding, even if we ignore the overhead from part 3. While gzip uses Huffman coding internally, it also relies on other ideas that we will learn about in the coming lectures.

Q6: HW1 Feedback (5 points)

Please answer the following questions, so that we can adjust the difficulty/nature of the problems for the next HWs.

How much time did you spent on the HW in total?
Which question(s) did you enjoy the most?
Are the programming components in the HWs helping you understand the concepts better?
Did the HW1 questions complement the lectures?
Any other comments?

Submission Instructions

Please submit both the written part and your code on Gradescope in their respective submission links. We will be using both autograder and manual code evaluation for evaluating the coding parts of the assignments. You can see the scores of the autograded part of the submissions immediately. For code submission ensure following steps are followed for autograder to work correctly:

You only need to submit the modified files as mentioned in the problem statement. For example, for problem 1, you only need to submit the modified HWs/HW1/hw1_p1.py file; for problem 4, you only need to submit the modified HWs/HW1/hw1_p4.py file; and so-on.
Compress the HW1 folder into a zip file. One way to obtain this zip file is by running the following zip command in the HWs folder, i.e.
```
cd HWs
zip -r HW1.zip HW1
```
Note: To avoid autograder errors, ensure that the directory structure is maintained and that you have compressed HW1 folder containing the relevant files and not HWs folder, or files inside or something else. Ensure the name of the files inside the folder are exactly as provided in the starter code, i.e. hw1_p4.py, hw1_p5.py etc. In summary, your zip file should be uncompressed to following directory structure (with same names):
```
HW1
├── hw1_p1.py
├── hw1_p4.py
└── hw1_p5.py
```
Submit the zip file (HW1.zip obtained above) on Gradescope Programming Part Submission Link. Ensure that the autograded parts runs and give you correct scores.

Before submitting the programming part on Gradescope, we strongly recommend ensuring that the code runs correctly locally.