Huffman Compression
Introduction:
This project is intended to introduce you to compression algorithms. Most people have used these programs extensively, without realizing it or understanding how compression works. One classic example of a program that employs a compression algorithm is WinZip.
In this project you will be asked to write assembly code that will decompress a file that has been compressed using the Huffman Algorithm. This algorithm is perhaps the simplest compression algorithm. Although this algorithm is not difficult, thorough understanding is required in order for you to be able to effectively decompress the file. Therefore, please read the following description of the Huffman algorithm very carefully and ask any questions that arise before programming this project.
The first thing you should be curious to find out is how compression works in general. There are currently two types of coding mechanisms for characters, ASCII and UNICODE. ASCII is used in most programming languages, such as C, while UNICODE is used in some recently-developed languages such as Java. UNICODE provides more characters so that languages with larger alphabets can be accommodated. Unfortunately, this set also takes up more memory, so in this project we shall focus on the ASCII set. The ASCII set contains 256 characters. In this set, each character is given a numerical value. Therefore eight bits are required for each character in order to be able to store each of the possible values (0 - 255). Say we have the following string: “What is the point of this?” In ASCII, the necessary characters to output this string would be stored in the following way:
space: 0010 0000
? : 0011 1111
W: 0101 0111
a : 0110 0001
e : 0110 0101
f : 0110 0110
h : 0110 1000
i : 0110 1001
n : 0110 1110
o : 0110 1111
p : 0111 0000
s : 0111 0011
t : 0111 0100
So the string itself would be stored in memory in the following manner:
01010111 01101000 01100001 01110100 00100000 01101001 01110011 00100000
01110100 01101000 01100101 00100000 01110000 01101111 01101001 01101110
01110100 00100000 01101111 01100110 00100000 01110100 01101000 01101001
01110011 00111111
So as you can see, this string would take up 208 bits in memory. But, what if we were to create our own character set based upon only the characters that occur in our string? Since we have exactly 13 characters, we would need 4 bits for each character. We could then define the same characters in the following way:
space: 0000
? : 0001
W: 0010
a : 0011
e : 0100
f : 0101
h : 0110
i : 0111
n : 1000
o : 1001
p : 1010
s : 1011
t : 1100
Then the string would be stored in memory in the following manner:
0010 0110 0011 1100 0000 0111 1011 0000 1100 0110 0100 0000 1010 1001 0111 1000 1100 0000 1001 0101 0000 1100 0110 0111 1011 0001
So as you can see, this string would take up 104 bits in memory. Although this difference is miniscule, notice that a compressed string takes up half the memory that an uncompressed one does. Therefore compressing large documents may be very useful.
Now that you have a general idea about how compression works, you should be wondering about the specifics of how Huffman Compression works. The program first traverses the file and creates a priority queue based upon which characters occur most often. Then, this priority queue is used to aid the creation of a binary tree. Lastly, the tree is used to obtain the values of each of the characters.
If you do not understand a priority queue, then you should look it up. This is a standard data structure and will/was discussed in one of the 15B classes and other classes. As whe will/have already covered trees in previous our classes, I will not take the time to define a tree and all of the terminology associated with this data structure here. I strongly suggest you brush up on your knowledge of trees before continuing. This binary tree is very unique. It has to be created in a way such that only the leaves of the tree will define characters. As a result, it is very easy to tell where one character ends and another begins. Now let’s discuss the rules for creating this tree. You are to assume that each character in a priority queue is a node and that its priority is its weight. You are then to take the two nodes with the smallest weights and give them a common root. This root will now have the combined weight of the two nodes. For example: character ‘a’ occurred three times and character ‘b’ occurred two times. Then the root would have a weight of five. Then take the root node and compare it with the other nodes of the priority queue. This process continues until all of the nodes are together in one tree. Since this process is difficult to describe, take a look at the following example:
Say you create the following priority queue for “What is the point of this?”:
Character: Weight:
space 5
t 4
h 3
i 3
s 2
o 2
W 1
a 1
e 1
p 1
n 1
f 1
? 1
We then pick the two nodes with the smallest weighting and give them a common root:

We continue until the tree is created:

Now if we assume that every left child of a node is designated by a zero, and every right child is designated by a one, we can obtain the values for each of the characters:
space : 00
t
: 111
h : 100
i
: 011
s : 1011
o : 1010
W: 0100
a : 01011
e : 01010
p : 11001
n : 11000
f :11011
? : 11010
Therefore the string will be stored in memory in the following way:
0100 100 01011 111 00 011 1011 00 111 100 01010 00 11001 1010 011 11000 111 00 1010 11011 00 111 100 011 1011 11010
As you can see, by using the Huffman Algorithm, we were able to store the string into only 90 bits.
Overview:
Performing both the compression and decompression in assembly is a daunting project. Therefore, you are only responsible for the decompression side. Decompression involves three steps.
The first one is to turn the header of the file, which contains the information necessary to decode the file, into a priority queue. Since the priority queue has to be made in exactly the same way as the one that encoded the file, this step has been taken care of for you.
The second step is to use the priority queue to create a binary tree. The function you will write to perform this operation will be called tcreate. This function will use the provided dequeue function to obtain the two elements that have the lowest priority. These elements will be the children of a root that tcreate will allocate and enqueue. This will continue until every element in the queue is connected to a common root with every other element. Once the tree has been created, it is your function’s responsibility to store its root into the global variable, troot. Since the process of creating the tree in decompression has to be very similar to the method used in compression, the first element obtained from the dequeue function will always be the left child and the second one will always be the right child. Pause for a second and try to figure out the last few sentences. They may seem very vague and hard to understand at first. If after reading the sentences over a few more times you still do not understand what is going on, get help from either Professor Kastner or a TA. It will be impossible for you to successfully complete this project without thoroughly understanding these directions.
The third step is to write a function decompress that will decompress the data with the help of the previously created binary tree. This function will be provided with one argument: a pointer to the compressed data. The function must decompress and output the data to the screen.
Technical Stuff:
The first question you should be having is how to perform memory allocation in spim. Take a look at the system calls table (syscall) in your textbook, more specifically the sbrk entry…
The second question that should be on your mind is how the nodes of the binary tree and the priority queue are to look. You should be especially curious of this because the same nodes are used to create both a priority queue and a binary tree. Each of the nodes will have a byte for the character value, a byte for the priority (rate of occurrence), two bytes of padding so that the memory addresses are word aligned, a pointer to the previous node, a pointer to the next node, a pointer to the left child, and a pointer to the right child. Remember, it is your job to ignore the character values unless the pointers to the children are null (thereby designating the node to be a leaf).
Every node will look like the following:

You should also be wondering how to find out what the end of the compressed data is. That is very simple. Encoded along with all of the other characters is the End-of-File character. This character is defined as -1 (or 255 when it is decoded). When you decode this character, you know that you have reached the end of the compressed data.
Hints:
Assume that all of the arguments and global variables are valid. Do not waste your time checking them.
You have been provided the code in c for creating the binary tree. Finding this code and understanding it may help you immensely in designing your tcreate function.
Output each character to the string as soon as you have decoded it, do not complicate the project by attempting to store all of the characters to the stack first.
If you are still uncomfortable with trees or priority queues, now would be a good time to brush up on them. Remember, assembly will do nothing for you, so you have to have a very good idea of how to create and traverse a tree or deal with a queue before you attempt to do it in assembly.
Do not just jump into this project without thinking about it first. This is not a project that can be done quickly in one sitting. It will require time and patience. Also, if you think through your project before creating it and you are still unsuccessful, at least you will be able to provide an in-depth read-me file that will probably provide you with some extra points.
Do not begin writing the decompress function until you have thoroughly tested the tcreate function and are sure that it works. There is no way that the decompress function will work if the tcreate function is not working properly.
This project took very much programming to create. It would be very beneficial for you to go through all of the code and attempt to understand it. This project really is what you make of it and spending some extra time on it could not possibly hurt.
Testing:
You will be provided two test-case files. These files are for your own use and are named testcase1.s and testcase2.s. At the top of each one of the test case files will be a comment that will inform you of the result your functions should output. Remember all you have to do to these files is add your functions to the designated location. If your function works appropriately, you should obtain its output during run-time.
C Code:
The Huffman compression code that was used as a prototype for this project is provided for your convenience: compress.h, compression.c, queue.c, tree.c. You do not need to understand this code in order to complete the project. If you understand everything explained above, then you can do the project. Also, this code implements an entire compression, decompression and all the required data structures. You are only responsible for decompression.
Turn-in:
When you are satisfied with your functions, add them to the turn-in file, which is named turnin.s. Add them exactly the same way that you would add it to the test cases and don’t worry. If they worked for all of the test cases, you should be fine. After you have appropriately modified turnin.s, add a README section that contains a verbal description of how your functions work as well as any problems you had and were unable to resolve. The site to submit turnin.s can be found at: http://www.ece.ucsb.edu/~kastner/ece15b/project3.php.
Good Luck!!