Skip to page content. |Motivate home.  
Google



Back to : Text, Bytes and Videotape Main Page

Display maths using:

fonts images
Need help?

Text, Bytes and Videotape

Task 2: Answers

Using A = 1, B = 011, C = 010, D = 001, E = 000 (which is another Huffman coding for the example in Olly's talk) to work out what 01101011000011 means (starting from the left), we might start by assuming that the words come in blocks of three as far as possible, since the only word with just one bit in it is A (1), giving:
011 / 010 / 11000011
At this point we hit a snag, because there isn't a word 110 in our code, so the first 1 at least must be an A. The second 1 must also be an A as we don't have 100 in our code either. So altogether we have:

011 010 1    1    000 011
B C A A E B


Even if we hadn't started looking for codewords of length 3, we couldn't have done anything else, since 0 on its own isn't a codeword. This is what is meant by a code being decipherable - that when you decode it the message is unambiguous.

The possible codings for the probabilities p(A) = 1/2, p(B) = 1/8, p(C) = 1/8, p(D) = 1/8, p(E) = 1/8, are as follows:

A = 0 B, C, D, E one each of {100, 101, 110, 111}
A = 1 B, C, D, E one each of {000, 001, 010, 011}

In each case the average word length is the same, ie. 2.

Using the same algorithm to find a Huffman coding for these probabilities:
p(a) = 0.34; p(b) = 0.12; p(c) = 0.05; p(d) = 0.17; p(e) = 0.10; p(f) = 0.16; p(g) = 0.06,
one way to start is to put the letters into order according to their probability, with the least probable on the left, then to put the first two together. On the next row, reorder the letters, with the new unit put into its correct place, according to its probability. Repeat this until all the letters are accounted for. At each stage, the letters which have been bracketed together are shown in bold type.

c

g

e

b

f

d

a

0.05

0.06

0.10

0.12

0.16

0.17

0.34

e
cg
b
f
d
a
0.10
0.11
0.12
0.16
0.17
0.34
b
f
d
ecg
a
 
0.12
0.16
0.17
0.21
0.34
d
ecg
bf
a
0.17
0.21
0.28
0.34
bf
a
decg
       
0.28
0.34
0.38
       
decg
bfa
         
0.38
0.62
         
decgbfa
           
1
           

This gives the following Huffman tree:

Huffman tree for the preceding table of probabilities

From this we can read off the codeword for each letter:

a b c d e f g
11 100 0110 00 010 101 0111
0.34 0.12 0.05 0.17 0.10 0.16 0.06

The bottom row of this table gives the probability of each letter occurring, for comparison. We can see that the Huffman coding gives the letters with the highest probability of occurring the shortest codewords (a and d have length 2), and the letters with the smallest probability of occurring have the longest codewords (c and g have length 4). The average length for a codeword in this coding is:

\begin{displaymath}0.34 x 2 + 0.12 x 3 + 0.05 x 4 +
0.17 x 2 + 0.10 x 3 + 0.16 x 2 + 0.06 x 4 = 2.6 \end{displaymath}

Calculating the mean length shows why it is important to have the shortest codewords associated with the letters which occur most frequently, and vice versa.

Here is an alternative coding made up to see if it is better than the Huffman coding:

a b c d e f g
0 01 100 1 10 00 11
0.34 0.12 0.05 0.17 0.10 0.16 0.06

The lengths of the codewords were chosen to minimise the mean length of a letter, which works out at 1.2, much less than that of the Huffman coding. This would be a better coding than the Huffman coding, if it is decipherable.

This is a random selection of letters from {a, b, c, d, e, f, g} (generated using the random integer function of a graphic calculator) coded using both codes:

Alternative coding:     00011000110010100110011110110010000100101

Huffman coding          1011000110100001010011011000011000011101000011001010111011000100

The Huffman coding is clearly longer, which coresponds to the average word length being longer, so on that basis the Alternative coding would be preferable. What happens when we try to decipher it?

a b c d e f g
0 01 100 1 10 00 11

0 / 0 / 0 / 1 / 1 / 0 / 0 / 0 / 110010100110011110110010000100101 = a a a d d a a a ...
or should that be
00 / 01 / 100 / 01 / 100 / 10100110011110110010000100101 = f b c b c ...
or even
00 / 0 / 11 / 0 / 0 / 01 / 10010100110011110110010000100101 = f a g a a b ...
and there are many other ways to decode this. Clearly although this code has the benefit of giving short codewords, it has the disadvantage of not being unambiguously decipherable.

Compare this with the Huffman coding:

a b c d e f g
11 100 0110 00 010 101 0111

101 / 100 / 0110 / 100 / 00 / 101 / 00 / 11 / 0110 / 00 / 0110 / 00 / 0111 / 010 / 00 / 0110 / 010 / 101 / 11 / 0110 / 00 / 100 = f b c b d f d a c d c d g e d c e f a c d b

This is precisely the original message, and at no point was there the option of decoding it differently (try it out for yourself).

 

Return to Task 2
Return to Olly's talk

 

 top of page

Back to : Text, Bytes and Videotape Main Page
contact | accessibility
© 2002 Millennium Mathematics Project, University of Cambridge