EE274 (Fall 23): Homework-1 Solution
- Focus area: Prefix-free codes, Basic Information theory
- Due Date: Oct 18, midnight (11:59 PM)
- Weightage: 15%
- Total Points: 135
0. Python, SCL setup (10 points)
-
Output of Command in 1.2
Your output to the command
python --version
should look something like
Python 3.9.18
-
Output of Command in 1.5
Your output to the command
find scl -name "*.py" -exec py.test -s -v {} +
should look something like
=========test session starts ========= platform darwin -- Python 3.9.18, pytest-7.4.2, pluggy-1.3.0 -- /Users/pulkittandon/opt/anaconda3/envs/ee274_env_HW1/bin/python rootdir: /Users/pulkittandon/Downloads/stanford_compression_library collected 48 items scl/core/encoded_stream.py::test_padder PASSED scl/core/encoded_stream.py::test_header PASSED scl/core/encoded_stream.py::test_encoded_block_reader_writer PASSED scl/core/prob_dist.py::ProbabilityDistTest::test_creation_entropy PASSED scl/core/prob_dist.py::ProbabilityDistTest::test_prob_creation_and_validation PASSED scl/core/prob_dist.py::ProbabilityDistTest::test_sorted_prob_dist PASSED scl/core/prob_dist.py::ProbabilityDistTest::test_validation_failure XFAIL scl/core/data_stream.py::test_list_data_stream PASSED scl/core/data_stream.py::test_file_data_stream PASSED scl/core/data_stream.py::test_uint8_file_data_stream PASSED scl/core/data_block.py::test_data_block_basic_ops PASSED scl/utils/bitarray_utils.py::test_basic_bitarray_operations PASSED scl/utils/bitarray_utils.py::test_get_bit_width PASSED scl/utils/bitarray_utils.py::test_bitarray_to_int PASSED scl/utils/bitarray_utils.py::test_float_to_bitarrays PASSED scl/compressors/golomb_coder.py::test_golomb_encode_decode PASSED scl/compressors/shannon_coder.py::test_shannon_coding PASSED scl/compressors/typical_set_coder.py::test_is_typical PASSED scl/compressors/typical_set_coder.py::test_typical_set_coder_roundtrip PASSED scl/compressors/arithmetic_coding.py::test_bitarray_for_specific_input PASSED scl/compressors/arithmetic_coding.py::test_arithmetic_coding avg_log_prob=1.473, avg_codelen: 1.512 avg_log_prob=1.447, avg_codelen: 1.482 avg_log_prob=1.429, avg_codelen: 1.442 avg_log_prob=2.585, avg_codelen: 2.606 PASSED scl/compressors/arithmetic_coding.py::test_adaptive_arithmetic_coding avg_log_prob=1.473, avg_codelen: 1.512 avg_log_prob=1.447, avg_codelen: 1.492 avg_log_prob=1.429, avg_codelen: 1.462 avg_log_prob=2.585, avg_codelen: 2.612 PASSED scl/compressors/arithmetic_coding.py::test_adaptive_order_k_arithmetic_coding k: 0, expected_bitrate=1.585, avg_codelen: 1.589 k: 1, expected_bitrate=1.585, avg_codelen: 1.591 k: 2, expected_bitrate=1.000, avg_codelen: 1.016 k: 3, expected_bitrate=1.000, avg_codelen: 1.025 PASSED scl/compressors/range_coder.py::test_range_coding avg_log_prob=1.499, avg_codelen: 1.506 avg_log_prob=1.484, avg_codelen: 1.490 avg_log_prob=1.430, avg_codelen: 1.436 avg_log_prob=0.000, avg_codelen: 0.006 PASSED scl/compressors/prefix_free_compressors.py::test_build_prefix_free_tree_from_code PASSED scl/compressors/universal_uint_coder.py::test_universal_uint_encode_decode PASSED scl/compressors/universal_uint_coder.py::test_universal_uint_encode PASSED scl/compressors/shannon_fano_elias_coder.py::test_shannon_fano_elias_coding PASSED scl/compressors/lz77.py::test_lz77_encode_decode PASSED scl/compressors/lz77.py::test_lz77_sequence_generation PASSED scl/compressors/lz77.py::test_lz77_multiblock_file_encode_decode PASSED scl/compressors/tANS.py::test_generated_lookup_tables PASSED scl/compressors/tANS.py::test_check_encoded_bitarray PASSED scl/compressors/tANS.py::test_tANS_coding tANS coding: avg_log_prob=1.499, tANS codelen: 1.502 tANS coding: avg_log_prob=0.815, tANS codelen: 0.819 tANS coding: avg_log_prob=1.422, tANS codelen: 1.427 PASSED scl/compressors/fano_coder.py::test_fano_coding PASSED scl/compressors/elias_delta_uint_coder.py::test_elias_delta_uint_encode_decode PASSED scl/compressors/elias_delta_uint_coder.py::test_elias_delta_uint_encode PASSED scl/compressors/rANS.py::test_check_encoded_bitarray PASSED scl/compressors/rANS.py::test_rANS_coding rANS coding: avg_log_prob=1.499, rANS codelen: 1.504 rANS coding: avg_log_prob=1.484, rANS codelen: 1.489 rANS coding: avg_log_prob=1.430, rANS codelen: 1.435 rANS coding: avg_log_prob=2.585, rANS codelen: 2.590 rANS coding: avg_log_prob=0.815, rANS codelen: 0.819 PASSED scl/compressors/huffman_coder.py::test_huffman_coding_dyadic Avg Bits: 1.0, optimal codelen: 1.0, Entropy: 1.0 Avg Bits: 1.527, optimal codelen: 1.527, Entropy: 1.5 Avg Bits: 1.794, optimal codelen: 1.794, Entropy: 1.75 PASSED scl/compressors/fixed_bitwidth_compressor.py::test_alphabet_encode_decode PASSED scl/compressors/fixed_bitwidth_compressor.py::test_fixed_bitwidth_encode_decode PASSED scl/compressors/fixed_bitwidth_compressor.py::test_text_fixed_bitwidth_file_encode_decode PASSED scl/external_compressors/zlib_external.py::test_zlib_encode_decode PASSED scl/external_compressors/zlib_external.py::test_zlib_file_encode_decode PASSED scl/external_compressors/pickle_external.py::test_pickle_data_compressor PASSED scl/external_compressors/zstd_external.py::test_zstd_encode_decode PASSED scl/external_compressors/zstd_external.py::test_zstd_file_encode_decode PASSED ===================================================================================== 47 passed, 1 xfailed in 13.29s ======================================================================================
-
1. Probability and Coin Toss (20 points)
Kedar is a big Cricket fan. In cricket (like many other sports), there is a coin toss in the beginning to determine which team gets the first chance to bat. However, Kedar thinks that the coin being used is biased and lands as Heads (H
) much more often than tails (T
). Let the probability of the coin landing on H
be .
Shubham suggests to Kedar that this problem can be solved (i.e we can get a fair coin) by tossing the coin twice! If we get a H,T
sequence in the two tosses, then we mark it as a heads, while if we get a T,H
sequence we can mark this as a tails. If we get a H,H
or T,T
then discard the result and start over!
-
[5 points] Do you think Shubham's solution is right? Please justify.
Solution
Shubham's solution is right. Scheme only terminates when
H,T
orT,H
sequences are received, which because of IID (independent and identically distributed) nature of our coin tosses and symmetry have the same probability. In fact, the iterative scheme can be thought of as a Geometric Random variable. After every two tosses, the scheme terminates with probability of (getting either aH,T
orT,H
sequence) and we are back to initial state with a probability of (gettingH,H
orT,T
sequence). If the scheme terminates after n sequences of these two tosses, we declare it heads or tails with the same probability of -
[5 points] What is the expected number of coin flips needed to get one
H
orT
outcome using Shubham's method?Solution
Using the fact that Shubham's solution is a Geometric Random variable with and involves two coin tosses per iteration, the expected number of coin flips needed to get one
H
orT
outcome using Shubham's method is -
[5 points] Kedar is still not convinced and thinks that we should really implement this idea to see if it works. Please help out Shubham by implementing the function
shubhams_fair_coin_generator()
in the fileHWs/HW1/hw1_p1.py
. NOTE: please run all of this in the(ee274_env)
environment which you created and that you are in the correct branch of theSCL
repo (EE274_Fall23/HWs
). Ensure that the tests except those inHWs
folder pass before you start implementing the function.Solution
def shubhams_fair_coin_generator(): """ TODO: use the biased_coin_generator() function above to implement the fair coin generator Outputs: toss (str): H, T num_biased_tosses (int): number of times the biased_coin_generator() was called to generate the output """ toss = None num_biased_tosses = 0 ############################## # ADD CODE HERE # raise NotImplementedError while True: toss1, toss2 = biased_coin_generator(), biased_coin_generator() num_biased_tosses += 2 if toss1 != toss2: toss = toss1 break ############################## return toss, num_biased_tosses
-
[5 points] It is always a good practice to write tests to ensure the implementation is correct (even if you think theoretically it is alright :)). Please implement
test_shubhams_fair_coin_generator()
function inHWs/HW1/hw1_p1.py
. You can test your code by running:py.test -s -v HWs/HW1/hw1_p1.py
Solution
def test_shubhams_fair_coin_generator(): """ TODO: write a test to check whether the shubhams_fair_coin_generator() is really generating fair coin tosses Also, check if the num_biased_tosses matches the answer which you computed in part 2. """ # perform the experiment # feel free to reduce this when you are testing num_trials = 10000 tosses = [] num_biased_tosses_list = [] for _ in range(num_trials): toss, num_biased_tosses = shubhams_fair_coin_generator() # check if the output is indeed in H,T assert toss in ["H", "T"] tosses.append(toss) num_biased_tosses_list.append(num_biased_tosses) # NOTE: We are using the DataBlock object from SCL. # Take a look at `core/data_block.py` to understand more about the class # the get_empirical_distribution() outputs a ProbDist class. Take a look at # `core/prob_dist.py` to understand this object better tosses_db = DataBlock(tosses) empirical_dist_tosses = tosses_db.get_empirical_distribution() ############################################# # ADD CODE HERE # 1. add test here to check if empirical_dist_tosses is close to being fair # 2. add test to check if avg value of num_biased_tosses matches what you expect # (both within a reasonable error range) # You can use np.testing.assert_almost_equal() to check if two numbers are close to some error margin # raise NotImplementedError np.testing.assert_almost_equal( empirical_dist_tosses.probability("H"), 0.5, decimal=1, err_msg="Probability of H is not 0.5", ) np.testing.assert_almost_equal( empirical_dist_tosses.probability("T"), 0.5, decimal=1, err_msg="Probability of T is not 0.5", ) expected_num_biased_tosses = sum(num_biased_tosses_list) / len(num_biased_tosses_list) np.testing.assert_almost_equal( expected_num_biased_tosses, 1/(0.8*0.2), decimal=1, err_msg="Expected Number of biased tosses is not 1/(p*(1-p))", ) #############################################
-
(NOT GRADED, THINK FOR FUN!): Do you think Shubham's solution is optimal? For example, Shubham is using at least
2
biased coin tosses to generate1
fair coin toss. This seems wasteful in some cases (for example, if the coin is almost unbiased). Can you improve upon the average number of biased coin tosses needed to generate one fair coin toss?Solution
Yes, Shubham's solution is not optimal. We can improve upon the average number of biased coin tosses by carefully utilizing the cases of
H,H
andT,T
sequences instead of throwing them away. For those of you who are aware of entropy, this problem has close relationship to the entropy! Here is a detailed analysis of this problem by Michael Mitzenmacher.
Q2: Basic Information theory (20 points)
-
[5 points] Let be a random variable over positive integers with a distribution Compute the entropy of random variable . What are the optimal code-lengths for a prefix-free code designed for distribution ?
Solution
where the last equality comes from summing an arithmetico-geometric progression. The optimal code-lengths for a prefix-free code designed for distribution is for by using the rule.
-
[5 points] Provide an optimal prefix-free code design for the distribution .
Solution One optimal prefix-free code design for the distribution is (note you can change
0
with1
WLOG): -
[10 points] Let be a random variable over positive integers such that . Then show that: For what distribution of the random variable does the equality hold, i.e. ? HINT: KL-Divergence and it's properties will be helpful to show this.
Solution Let's take KL divergence between the distribution and the distribution defined in
Q1.1
.where the last line uses the definition of expectation and entropy.
Since we know that is non-negative and , we have that showing the asked inequality.
We also know that the equality in KL divergence holds when and hence the equality holds for geometric distribution as described in part
Q1.1
, i.e.
Q3. Uniquely decodable codes (25 points)
We mainly explored prefix-free codes in class. But there is a much broader class of codes, the uniquely decodable codes. A code is uniquely decodable if no two sequences (of length 1 or more) of input symbols (say and ) can produce the same encoding, and hence one can uniquely determine the input sequence from the encoding. In this question, we will show that even with the uniquely decodable codes, we cannot do better than the Entropy.
-
[5 points] Consider the code
C1
below. IsC1
uniquely decodable? Implement a decoder for the code below. Briefly justify your decoding procedure.A -> 10 B -> 00 C -> 11 D -> 110
Solution The code
C1
is uniquely decodable. The decoding procedure is as follows:- read first two bits
- if the read bits are
10
, then the decoded symbol isA
- if the read bits are
00
, then the decoded symbol isB
- if the read bits are
11
, then- calculate number of
0
s in the remaining encoded bitstream till you get first1
. say this number is . - if is
odd
, then outputD
followed byB
. - if is
even
, then outputC
followed byB
.
- calculate number of
- repeat the above procedure till you have decoded all the symbols.
This procedure works because only
C
andD
share a prefix butC
is shorter thanD
but differ in only1 bit
whereas there is no symbol which is being encoded using only1 bit
. Hence, the decoding procedure is unambiguous.Consider an alphabet and a uniquely decodable code with code lengths for denoted by . Also, we denote the codelength of the n-tuple as .
-
[5 points] Show that
Solution where
- first line just rewrites the th power of the Kraft sum as a product of distinct sums by renaming the variable.
- second line pulls the sums out since the summation variables are all different.
- third line moves product into the exponent where it becomes a sum
- fourth line just uses the fact the code length of is simply the sum of code lengths of the symbols .
-
[5 points] Let . Then we can rewrite the summation as:
NOTE: Rewriting summation is a common proof trick, and is a useful one to watch out for! Using
Q4.2
and the identity above, show that:Solution
Let the maximum codelength be . Then note that,
Here we have clubbed the sum according to values of . This implies using result from
Q2.2
thatwhere the last inequality uses the fact that the code is uniquely decodable and for such a code, the set of -tuples giving a length bit sequence can be at most .
-
[5 points] Using
Q4.3
show that uniquely decodable codes satisfy Kraft's inequality i.e.Solution
Using
Q2.3
, we have that for any -tuple of a uniquely-decodable code! Therefore, taking -th root on both sides and taking limit as goes to infinity, we get thattherefore showing that Kraft's inequality holds for uniquely decodable codes.
-
[5 points] Now show that for any uniquely decodable code for with codeword lengths , Also argue that for any uniquely decodable code, it is possible to construct a prefix-free code with the same codeword lengths. This means that uniquely decodable codes generally do not provide any benefits over prefix-free codes and instead have a more complicated decodable procedure!
NOTE: Feel free to refer to the corresponding proof for prefix-free codes.
Solution
Consider
Scheme to get a prefix-free code from uniquely decodable code would then be to extract the codelengths from the uniquely decodable code and then use the same code-lengths to construct a prefix-free code similar to Huffman tree.
Q4: Shannon Code(s) (25 points)
In class, we saw one version of the Shannon codes; the tree-based construction.
-
[5 points] Let's call it the
ShannonTreeEncoder
and theShannonTreeDecoder
. Manually calculate what the codewords should be for the following distributions:ProbabilityDist({"A": 0.25, "B": 0.25, "C": 0.25, "D": 0.25}) ProbabilityDist({"A": 0.5, "B": 0.25, "C": 0.12, "D": 0.13}) ProbabilityDist({"A": 0.9, "B": 0.1})
Solution
- For the first distribution, the codewords should be
{"A": BitArray("00"), "B": BitArray("01"), "C": BitArray("10"), "D": BitArray("11")}
. - For the second distribution, the codewords should be
{"A": BitArray("0"), "B": BitArray("10"), "C": BitArray("1110"), "D": BitArray("110")}
. - For the third distribution, the codewords should be
{"A": BitArray("0"), "B": BitArray("1000")}
.
- For the first distribution, the codewords should be
-
[10 points] Complete the code for the
ShannonTreeEncoder
inhw1_p4.py
. Also, complete thetest_shannon_tree_coding_specific_case()
to check the correctness of your codewords in part 1 against your implemented code. If you encounter errors, you should identify the failing test case and fix any issues. Note: we will be using automated tests to grade your code, so it is important that you do not change the function signatures. We will have more test cases for grading, so feel free to add more test cases of your own of your own when working on the problem.Solution
def generate_shannon_tree_codebook(cls, prob_dist): # sort the probability distribution in decreasing probability sorted_prob_dist = ProbabilityDist.get_sorted_prob_dist( prob_dist.prob_dict, descending=True ) codebook = {} ############################################################ # ADD CODE HERE # raise NotImplementedError import math from utils.bitarray_utils import uint_to_bitarray cur_state = 0 # tracks the next unused node cur_codelen = 1 for s in sorted_prob_dist.prob_dict: codelen = math.ceil(sorted_prob_dist.neg_log_probability(s)) cur_state = cur_state << (codelen-cur_codelen) cur_codelen = codelen codebook[s] = uint_to_bitarray(cur_state, bit_width=codelen) cur_state += 1 ############################################################ return codebook
def test_shannon_tree_coding_specific_case(): # NOTE -> this test must succeed with your implementation ############################################################ # Add the computed expected codewords for distributions presented in part 1 to these list to improve the test # raise NotImplementedError distributions = [ ProbabilityDist({"A": 0.5, "B": 0.5}), ProbabilityDist({"A": 0.25, "B": 0.25, "C": 0.25, "D": 0.25}), ProbabilityDist({"A": 0.5, "B": 0.25, "C": 0.12, "D": 0.13}), ProbabilityDist({"A": 0.9, "B": 0.1}), ] expected_codewords = [ {"A": BitArray("0"), "B": BitArray("1")}, {"A": BitArray("00"), "B": BitArray("01"), "C": BitArray("10"), "D": BitArray("11")}, {"A": BitArray("0"), "B": BitArray("10"), "C": BitArray("1110"), "D": BitArray("110")}, {"A": BitArray("0"), "B": BitArray("1000")}, ] ############################################################ def test_encoded_symbol(prob_dist, expected_codeword_dict): """ test if the encoded symbol is as expected """ encoder = ShannonTreeEncoder(prob_dist) for s in prob_dist.prob_dict.keys(): assert encoder.encode_symbol(s) == expected_codeword_dict[s] for i, prob_dist in enumerate(distributions): test_encoded_symbol(prob_dist, expected_codeword_dict=expected_codewords[i])
We have also studied the tree-parsing based decoding for Shannon codes (or prefix-free codes in general). This decoding has already been implemented in the ShannonTreeDecoder
. The core tree-parsing logic can be found in here. The tree-parsing based decoder works well however includes too many if/else conditions which leads to branching -- this is pretty bad for modern hardware. (Here is some additional information about this inefficiency for those interested.)
In this next subproblem, we will implement another decoder which we call the ShannonTableDecoder
. The ShannonTableDecoder
works by creating a decoding lookup table and then using it for decoding symbols without branching.
As an example, lets consider the following simple code:
encoding_table = {"A": 0, "B": 10, "C": 110, "D": 111}
In this case, the ShannonTableDecoder
keeps track of data using following tables:
decoding_table = {
000: "A",
001: "A",
010: "A",
011: "A",
100: "B",
101: "B",
110: "C",
111: "D",
}
codelen_table = {
"A": 1,
"B": 2,
"C": 3,
"D": 3,
}
max_codelen = 3
The tables can be then used for decoding as per the pseudo-code below.
def decode_symbol_table(encoded_bitarray):
state = encoded_bitarray[:max_codelen]
decoded_symbol = decoding_table[state]
num_bits_consumed = codelen_table[decoded_symbol]
return decoded_symbol, num_bits_consumed
-
[5 points] Based on the pseudo-code above, explain in a few sentences how the table based decoding works. Why do you need to output both the decoded symbol and the number of bits consumed?
Solution
decode_symbol_table(encoded_bitarray)
takes the firstmax_codelen
bits from theencoded_bitarray
and uses it as a key to look up thedecoding_table
to get the decoded symbol. It then uses the decoded symbol to look up thecodelen_table
to get the number of bits consumed. It then returns the decoded symbol and the number of bits consumed. The number of bits consumed is needed to know where to start the next decoding from. -
[5 points] Complete the
ShannonTableDecoder
inhw1_p4.py
by implementing thecreate_decoding_table
function. You can verify if your implementation is correct by usingtest_shannon_table_coding_end_to_end
function.Solution
def create_decoding_table(prob_dist: ProbabilityDist): """ :param prob_dist: ProbabilityDist object :return: decoding_table: dictionary mapping bitarrays to symbols codelen_table: dictionary mapping symbols to code-length max_codelen: maximum code-length of any symbol in the codebook """ # create the encoding table encoding_table = ShannonTreeEncoder.generate_shannon_tree_codebook(prob_dist) ############################################################ # ADD CODE HERE # NOTE: # - The utility functions ProbabilityDist.neg_log_probability # - scl.utils.bitarray_utils.uint_to_bitarray and scl.utils.bitarray_utils.bitarry_to_uint might be useful # raise NotImplementedError ############################################################ # we now create the decoding table based on the encoding table # first get the maximum length max_codelen = max(len(code) for code in encoding_table.values()) # create a empty table of size 2^{max_codelen} decoding_table = {} codelen_table = {} # let's fill the decoding table for s, code in encoding_table.items(): codelen = len(code) start_index = bitarray_to_uint(code + "0" * (max_codelen - codelen)) num_indices = 1 << (max_codelen - codelen) for ind in range(start_index, start_index + num_indices): decoding_table[str(uint_to_bitarray(ind, bit_width=max_codelen))] = s codelen_table[s] = codelen return decoding_table, codelen_table, max_codelen
Q5. Huffman coding on real data (20 points)
In class, we have studied compression for sources where we knew the distribution of data beforehand. In most real-life scenarios, this is not the case. Often times, we use the empirical distribution of the data and design a code for this distribution. Then we encode the data with this distribution. For the decoding to work, we also need to transmit the distribution or the code in some way. This problem focuses on compression of a provided file with Huffman coding.
Before we start with the code, let's look at how important it is to transmit the distribution faithfully to achieve lossless compression.
- [10 points] Consider the probability distribution represented as a mapping from a symbol to its probability:
{A: 0.11, B: 0.09, C: 0.09, D: 0.71}
.
a. What are the codeword lengths for the symbols in the Huffman code?
Solution Building the huffman tree,
graph TD N1["B<br/>(p=0.09)"]:::endnode N2("C<br/>(p=0.09)"):::endnode N3("A<br/>(p=0.11)"):::endnode N4("D<br/>(p=0.71)"):::endnode N5("N1<br/>(p=0.18)") --> N1 N5 --> N2 N6("N2<br/>(p=0.29)") --> N3 N6 --> N5 N7("*<br/>(p=1.0)") --> N4 N7 --> N6 style N5 fill:#dddddd style N6 fill:#dddddd style N7 fill:#dddddd
the codeword lengths are
{A: 2, B: 3, C: 3, D: 1}
After a harrowing cab experience in US, Pulkit wants to send a codeword BADCAB
to Shubham to describe his frustration. Shubham and Pulkit had agreed to use Huffman encoding for such communications in the past. He encodes the codeword BADCAB
using the original distribution {A: 0.11, B: 0.09, C: 0.09, D: 0.71}
, and sends both the codeword and this distribution to Shubham (so that Shubham is able to decode it on his end by building a Huffman tree on his own).
b. Unfortunately, Shubham decodes the message to be CADBAC
instead of BADCAB
making him confused. What might have gone wrong during this communication between the two?
HINT: notice the specific encoded and decoded symbols in the message.
Solution
When Shubham built his Huffman tree, he inverted the codewords for B
and C
since they have the same codelengths, and hence decoded the message incorrectly by switching the two letters in his decoding. Recall that, huffman codes are not unique and hence the codeword lengths are not enough to reconstruct the Huffman tree. The order of the symbols in the distribution is also important to reconstruct the Huffman tree.
Pulkit-Shubham realized this issue and fixed it.
c. In the spirit of frugality, Pulkit decides to transmit a rounded distribution with just one significant decimal ({A: 0.1, B: 0.1, C: 0.1, D: 0.7}
). What are the codeword lengths for the symbols in the Huffman code for this distribution? Are there multiple possible Huffman code lengths? Why?
Solution
There are multiple possible Huffman code lengths since the distribution is not unique. For example, the distribution {A: 2, B: 3, C: 3, D: 1}
or {A: 3, B: 3, C: 2, D: 1}
are valid huffman codeword lengths for each character depending on how the ties were broken in merging the first symbol.
These issues suggest that it is important to transmit the distribution faithfully to achieve lossless compression. In addition, the encoder and the decoder must share the same deterministic algorithm to construct the Huffman tree from the distribution.
With this context, look at the provided starter code in hw1_p5.py
which does the following:
- Loop over the file in blocks of size 50 KB.
- For each block, compute the empirical distribution from the data. This simply means we get the counts of each possible byte value in the block and normalize by the block length to get an approximate empirical probability distribution.
- Apply Huffman encoding on the block assuming this empirical distribution.
- Encode the empirical probability distribution in bits so that decoder knows what probability distribution to use for decoding.
- Final encoding is the encoding from step 4 concatenated with the encoding from step 3.
To understand the overall process, you should go through the provided code and specifically the encode_block
and decode_block
functions.
- [10 points] Implement the encoding and decoding of probability distribution step above (Step 4) in the functions
encode_prob_dist
&decode_prob_dist
respectively. There are multiple ways to encode this, and we are not looking for the most optimized implementation here. Note that we are working with 50 KB blocks here, while the alphabet size is just 256. So as long as your implementation is not absurdly bad, the overhead will be small. You might find the following pointers useful as you implement the function:- The input
bitarray
todecode_prob_dist
can include more than the encoding of the probability distribution itself. Thus, it should only read as much as it needs and return the number of bits read so that the Huffman decoding can start from the next bit. - Python dictionaries are OrderedDicts. If you are using dictionaries to represent probability distribution, then it is critical to maintain this ordering while creating Huffman trees during encoding/decoding. See the Pulkit-Shubham communication fiasco above for an example of what can go wrong if the order is not preserved.
- Python's
float
type is equivalent to thedouble
type in C (see this for a refresher). As long as you provide the same exact distribution to the encoder and decoder, all is well! - Also, you will hear about Canonical Huffman code in class, which provides an alternative to solve some of these issues in practice. You do not need it for this question.
- The input
Verify that your implementation is correct by running
py.test -s -v HWs/HW1/hw1_p5.py
If you encounter errors, you should identify the failing test case and fix any issues. Note: we will be using automated tests to grade your code, so it is important that you do not change the function signatures and that your code passes these tests. We will have more test cases for grading, so feel free to add more test cases of your own when working on the problem. But these should be sufficient to get you started.
Solution
One way to encode and decode the probability distribution is by using pickle. Students have various different ways to serialize and deserialize the probability table, and we have accepted all of them as valid solutions as long as the pass they end-to-end encoding-decoding test.
def encode_prob_dist(prob_dist: ProbabilityDist) -> BitArray:
"""Encode a probability distribution as a bit array
Args:
prob_dist (ProbabilityDist): probability distribution over 0, 1, 2, ..., 255
(note that some probabilities might be missing if they are 0).
Returns:
BitArray: encoded bit array
"""
#########################
# ADD CODE HERE
# bits = BitArray(), bits.frombytes(byte_array), uint_to_bitarray might be useful to implement this
# raise NotImplementedError("You need to implement encode_prob_dist")
# pickle prob dist and convert to bytes
pickled_bits = BitArray()
pickled_bits.frombytes(pickle.dumps(prob_dist))
len_pickled = len(pickled_bits)
# encode length of pickle
length_bitwidth = 32
length_encoding = uint_to_bitarray(len_pickled, bit_width=length_bitwidth)
encoded_probdist_bitarray = length_encoding + pickled_bits
#########################
return encoded_probdist_bitarray
def decode_prob_dist(bitarray: BitArray) -> Tuple[ProbabilityDist, int]:
"""Decode a probability distribution from a bit array
Args:
bitarray (BitArray): bitarray encoding probability dist followed by arbitrary data
Returns:
prob_dit (ProbabilityDist): the decoded probability distribution
num_bits_read (int): the number of bits read from bitarray to decode probability distribution
"""
#########################
# ADD CODE HERE
# bitarray.tobytes() and bitarray_to_uint() will be useful to implement this
# raise NotImplementedError("You need to implement decode_prob_dist")
# first read 32 bits from start to get the length of the pickled sequence
length_bitwidth = 32
length_encoding = bitarray[:length_bitwidth]
len_pickled = bitarray_to_uint(length_encoding)
# bits to bytes
pickled_bytes = bitarray[length_bitwidth: length_bitwidth + len_pickled].tobytes()
prob_dist = pickle.loads(pickled_bytes)
num_bits_read = length_bitwidth + len_pickled
#########################
return prob_dist, num_bits_read
-
[5 points] Now download the Sherlock Holmes novel "The Hound of the Baskervilles by Arthur Conan Doyle" using the following command.
curl -o HWs/HW1/sherlock.txt https://www.gutenberg.org/files/2852/2852-0.txt
Run the following commands to compress the file using your newly developed compressor, decompress it back, and verify that the decompression succeeds.
python HWs/HW1/hw1_p3.py -i HWs/HW1/sherlock.txt -o HWs/HW1/sherlock.txt.huffz python HWs/HW1/hw1_p3.py -d -i HWs/HW1/sherlock.txt.huffz -o HWs/HW1/sherlock.txt.decompressed cmp HWs/HW1/sherlock.txt HWs/HW1/sherlock.txt.decompressed
Nothing should be printed on the terminal as a result of the last command if the files match.
- Report the size of your compressed file. You can run
wc -c sherlock.txt.huffz
to print the size. - How much is the overhead due to storing the probability distribution? Note: One easy way to check this to replace
encode_prob_dist
with a version that just returnsBitArray()
. Obviously the decoding won't work with this change! - Print the obtained huffman tree on sherlock.txt by using the
print_huffman_tree
function. What do you observe? Does the printed Huffman tree make sense? Why or why not?
Solution
- The size of the compressed file using implementation as given in solution above is
221804
bytes. - The size of the compressed file without probability table is
214409 bytes
. So the overhead using our approach is7395 bytes
(~). - We get a huge Huffman tree, so we are skipping reproducing it here in the solution manual. But the observed Huffman tree makes sense with frequently occurring vowels and punctuation getting lower lengths compared to rarely occurring characters like
X
or unassigned ASCII codes.
- Report the size of your compressed file. You can run
-
[5 points] Now run gzip on the file (command:
gzip < HWs/HW1/sherlock.txt > HWs/HW1/sherlock.txt.gz
) and report the size of the gzip file (sherlock.txt.gz
). What happens if you run our Huffman compressor on the gzipped file above? Do you see any further reduction in size? Why/why not?Solution
- The size of the compressed file using gzip is
134718 bytes
. - The size of the compressed file using our Huffman compressor on the gzipped file is
143409 bytes
. So there is no further reduction in size but instead an increase in size! This is because the Huffman compressor is not able to compress the file further as the file is already compressed using gzip which uses Huffman coding as part of its compression procedure. Gzip also uses LZ77 as described in the note in the problem.
- The size of the compressed file using gzip is
You will observe that gzip does significantly better than the Huffman coding, even if we ignore the overhead from part 3. One of the reasons for this is because gzip compresses data in blocks instead of at the level of single-symbol like we implemented in this problem. Another reason is that gzip uses a more sophisticated algorithm called LZ77. LZ77 is a lossless compression algorithm that is based on the observation that many files contain repeated patterns. LZ77 uses a sliding window to find repeated patterns and replaces them with a pointer to the location of the pattern. This is a very powerful technique that is used in many compression algorithms. We will study LZ77 in more detail in future class and next homework.