CSci 3501 UMM - Lab 13

CSci 3501 Algorithms and Computability - Lab 13.

Due Wednesday, December 7 at 11:59pm

What to submit and when:

All submissions are electronic: by e-mail to elenam at morris.umn.edu and CC to all lab partners. Please do not delete your e-mail from "Sent mail" or your mailbox until the end of the semester.
When working on the lab, please comment your work so that it is clear what contributions of each person are.
At the end of the lab each group should send me the results of their in-class work. Please indicate if this is your final submission.
If your submission at the end of the lab time was not final, please send me(CC to the lab partner(s)) a final copy before the due time. Please use the subject "3501 Lab N", where N is the lab number.

Lab assignment

Work in pairs

Lab overview and goals

The goal of the lab is to explore Huffman encoding as a greedy algorithm.

Implementing Huffman encoding (30 points)

Huffman encoding is an approach to data encoding that minimizes storage by assigning shorter codes to more frequent letters and longer codes to more rare ones. The encoding is known as prefix-free which means that no letter code is a prefix of any other one. For instance, if the most frequent letter is encoded as 0 then no other letter code starts with 0. Thus decoding is unambiguous at every point. Huffman codes are described in section 16.3 (p. 428) in CLRS and on the wikipedia page https://en.wikipedia.org/wiki/Huffman_coding Huffman encoding is an optimal encoding (i.e. it produces the shortest encoding of 0s and 1s) for a text that follows the specified letter frequencies.

As an example, consider an alphabet of 5 letters with frequencies given below:

Letter:	a	b	c	d	e
Frequency, %:	51	22	15	7	5

The algorithms for finding Huffman code forms a binary tree in which nodes represent either a single letter or a combination of letters. Initially all letters are placed in a min-priority queue, where the key is the frequency. The algorithms proceeds by combining two nodes with the least frequencies into one that has the combined frequency, and adding it back into the queue. The process continues until there is only one combined node which is the root of the trree. Then each left branch in the tree is denoted by 0 and the right branch by 1. The codes for each letter are the paths to them in the tree.
The textbook or the compression section of the wikipedia page provide more details.

The encoding for the example above is:

Letter:	a	b	c	d	e
Encoding:	1	00	011	0101	0100

As an exercise, encode the strings ace and add according to the above encoding. Then decode the strings 101011 and 0010101. Note how prefix-free encoding makes decoding easy.

Tasks for the lab

Your goal is to write a program that reads in an alphabet with letter frequencies and produces Huffman codes for this letters. It then allows you to encode and decode texts in this alpahbet using the Huffman encoding you have constructed. More specifically:

Your program must read data in the format (on one line):
```
      a 51 b 22 c 15 d 7 e 5
    
```
The frequences are assumed to be in decreasing order and add up to 100%. New line signals end of input.
It should output encodings for each letter.
It should then take inputs in the format:
```
      e add
    
```
Where the first letter is "e" for "encode" and "d" for "decode", and the following word is what you would like to encode or decode. Spaces and other symbols outside the alphabet should be ignored.
Entering letter x on the line by itself terminates the program.

When submitting your program, please explain what its efficiency is for finding Huffman codes (where n is the number of letters in the alphabet). Don't forget to include the efficiency of data structures that you are using, if any.

What to submit

Submit your .java files as attachments, CC your group. Include all your test cases.

CSci 3501 course web site.