I am trying to understand a Huffman code written in python by 'Rosetta code'. Following is small part from the code.
def encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()] #What does this do?
I'm assuming variable heap is a list. But what is wt and sym?
-
sym looks like the symbols (the keys of the dict coming as parameter) and wt their weights (values associated)Gerard Rozsavolgyi– Gerard Rozsavolgyi2015年12月16日 22:45:07 +00:00Commented Dec 16, 2015 at 22:45
-
1Link to the code, by the way.Reti43– Reti432015年12月16日 23:13:17 +00:00Commented Dec 16, 2015 at 23:13
-
as a heads up, it's poor etiquette to pose a question and not respond to any of the answers. Either comment or click a check mark to close this issue out.Rob– Rob2015年12月17日 01:33:28 +00:00Commented Dec 17, 2015 at 1:33
4 Answers 4
sym2freq is a dictionary with keys some symbols and their values the symbol's frequency. For example, if you have the string 'aaabacba', the dictionary will look like this
sym2freq = {'a': 5, 'b': 2, 'c': 1}
That's because we have 5 times the letter a, 2 times the letter b and 1 time the letter c.
Dictionaries have the method items(), which will return a tuple of each key with its respective value. In our case, it would return
>>> sym2freq.items()
(('a', 5), ('b', 2), ('c', 1))
The for sym, wt in symb2freq.items() part of the comprehension list is just unpacking. Every time we fetch one of our tuples from above, we assign the first object to the variable sym and the second to the variable wt. The names sym and wt were chosen purely to be reflective of the values the represent, i.e. symbol and weight (frequency).
>>> sym, wt = ('a', 5)
>>> print sym
'a'
>>> print wt
5
And since the list comprehension will create lists of the structure [wt, [sym, ""]], you'll end up with the list
>>> encode(sym2freq)
[[5, ['a', '']], [2, ['b', '']], [1, ['c', '']]]
The reason why we go from a dictionary of symbol frequencies to a structure like the list heap is so we can sort out symbols in terms of their frequencies, which, as you are learning, is part of constructing the Huffman tree.
This is a list comprehension. This is saying
- Get the items of
symb2freqand start to loop over them. - Get the first item in
symb2freqand unpack them into variablessymandwt. - Now add
[wt, [sym, ""]]to the list - Do this for each item
- Now put the list in the variable
heap
For example, [bar(x) for x in foo] makes a list that applies bar(x) to each value in the list.
Comments
This is an interesting problem. Imagine a dictionary with five letters where the key is the letter and the value is the frequency:
a = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
then mapping it so that it's a list of lists with frequency, and a list starting with the letter into the heap variable like this:
[[1, ['a', '']], [2, ['b', '']], [3, ['c', '']], [4, ['d', '']], [5, ['e', '']]]
then when you heapify it, you get:
[[1, ['a', '']],
[3, ['c', '']],
[2, ['b', '']],
[5, ['e', '']],
[4, ['d', '']]]
Just work things out step by step with a valid dictionary to encode like the one given on the wikipedia page and you'll be able to see what each step does. Or use a trivial example to start like the one I gave and slowly grow the number of elements until you're working with a real document.
What the rosetta code is doing is iterating through the heap and adding digits to the blank string you see after the letter based on frequency, until each letter is mapped to a binary interpretation. So in the end you'd having something like this:
[['e', '101'],
['d', '010'],
['c', '1001'],
['b', '1100'],
['a', '1101']]
where the most frequent letters require the fewest bits.
Comments
sym is symbol, and wt is weight.