Python, understanding Huffman code

Question 1

I am trying to understand a Huffman code written in python by 'Rosetta code'. Following is small part from the code.

def encode(symb2freq):
 heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()] #What does this do?

I'm assuming variable heap is a list. But what is wt and sym?

Question 2

sym looks like the symbols (the keys of the dict coming as parameter) and wt their weights (values associated)

Question 3

Link to the code, by the way.

Question 4

as a heads up, it's poor etiquette to pose a question and not respond to any of the answers. Either comment or click a check mark to close this issue out.

Question 5

sym2freq is a dictionary with keys some symbols and their values the symbol's frequency. For example, if you have the string 'aaabacba', the dictionary will look like this

sym2freq = {'a': 5, 'b': 2, 'c': 1}

That's because we have 5 times the letter a, 2 times the letter b and 1 time the letter c.

Dictionaries have the method items(), which will return a tuple of each key with its respective value. In our case, it would return

>>> sym2freq.items()
(('a', 5), ('b', 2), ('c', 1))

The for sym, wt in symb2freq.items() part of the comprehension list is just unpacking. Every time we fetch one of our tuples from above, we assign the first object to the variable sym and the second to the variable wt. The names sym and wt were chosen purely to be reflective of the values the represent, i.e. symbol and weight (frequency).

>>> sym, wt = ('a', 5)
>>> print sym
'a'
>>> print wt
5

And since the list comprehension will create lists of the structure [wt, [sym, ""]], you'll end up with the list

>>> encode(sym2freq)
[[5, ['a', '']], [2, ['b', '']], [1, ['c', '']]]

The reason why we go from a dictionary of symbol frequencies to a structure like the list heap is so we can sort out symbols in terms of their frequencies, which, as you are learning, is part of constructing the Huffman tree.

Question 6

Thank you @Reti43 very much, brilliantly explained!

Question 7

Yes. That's correct. (author).

Question 8

This is a list comprehension. This is saying

Get the items of symb2freq and start to loop over them.
Get the first item in symb2freq and unpack them into variables sym and wt.
Now add [wt, [sym, ""]] to the list
Do this for each item
Now put the list in the variable heap

For example, [bar(x) for x in foo] makes a list that applies bar(x) to each value in the list.

Question 9

This is an interesting problem. Imagine a dictionary with five letters where the key is the letter and the value is the frequency:

a = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

then mapping it so that it's a list of lists with frequency, and a list starting with the letter into the heap variable like this:

[[1, ['a', '']], [2, ['b', '']], [3, ['c', '']], [4, ['d', '']], [5, ['e', '']]]

then when you heapify it, you get:

[[1, ['a', '']],
 [3, ['c', '']],
 [2, ['b', '']],
 [5, ['e', '']],
 [4, ['d', '']]]

Just work things out step by step with a valid dictionary to encode like the one given on the wikipedia page and you'll be able to see what each step does. Or use a trivial example to start like the one I gave and slowly grow the number of elements until you're working with a real document.

What the rosetta code is doing is iterating through the heap and adding digits to the blank string you see after the letter based on frequency, until each letter is mapped to a binary interpretation. So in the end you'd having something like this:

 [['e', '101'], 
 ['d', '010'], 
 ['c', '1001'], 
 ['b', '1100'], 
 ['a', '1101']]

where the most frequent letters require the fewest bits.

Question 10

sym is symbol, and wt is weight.

https://en.wikipedia.org/wiki/Huffman_coding

Reti43 9,8273 gold badges30 silver badges48 bronze badges · Accepted Answer · 2015-12-16 23:09:46Z

sym2freq is a dictionary with keys some symbols and their values the symbol's frequency. For example, if you have the string 'aaabacba', the dictionary will look like this

sym2freq = {'a': 5, 'b': 2, 'c': 1}

That's because we have 5 times the letter a, 2 times the letter b and 1 time the letter c.

Dictionaries have the method items(), which will return a tuple of each key with its respective value. In our case, it would return

>>> sym2freq.items()
(('a', 5), ('b', 2), ('c', 1))

The for sym, wt in symb2freq.items() part of the comprehension list is just unpacking. Every time we fetch one of our tuples from above, we assign the first object to the variable sym and the second to the variable wt. The names sym and wt were chosen purely to be reflective of the values the represent, i.e. symbol and weight (frequency).

>>> sym, wt = ('a', 5)
>>> print sym
'a'
>>> print wt
5

And since the list comprehension will create lists of the structure [wt, [sym, ""]], you'll end up with the list

>>> encode(sym2freq)
[[5, ['a', '']], [2, ['b', '']], [1, ['c', '']]]

The reason why we go from a dictionary of symbol frequencies to a structure like the list heap is so we can sort out symbols in terms of their frequencies, which, as you are learning, is part of constructing the Huffman tree.

CollectivesTM on Stack Overflow

Python, understanding Huffman code

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related