Huffman encoding issue

Question 1

As an exercise I'm trying to encode some symbols using Huffman trees, but using my own class instead of the built in data types with Python.

Here is my node class:

class Node(object):
 left = None
 right = None
 weight = None
 data = None
 code = ''
 length = len(code)
 def __init__(self, d, w, c):
 self.data = d
 self.weight = w
 self.code = c
 def set_children(self, ln, rn):
 self.left = ln
 self.right = rn
 def __repr__(self):
 return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
 def __cmp__(self, a):
 return cmp(self.code, a.code)
 def __getitem__(self):
 return self.code

and here is the encoding function:

def encode(symbfreq):
 tree = [Node(sym,wt,'') for sym, wt in symbfreq]
 heapify(tree)
 while len(tree)>1:
 lo, hi = sorted([heappop(tree), heappop(tree)])
 lo.code = '0'+lo.code
 hi.code = '1'+hi.code
 n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
 n.set_children(lo, hi)
 heappush(tree, n)
 return tree[0]

(Note, that the data field will eventually contain a set() of all the items in the children of a node. It just contains a sum for the moment whilst I get the encoding correct).

Here is the previous function I had for encoding the tree:

def encode(symbfreq):
 tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
 heapq.heapify(tree)
 while len(tree)>1:
 lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
 for pair in lo[1:]:
 pair[1] = '0' + pair[1]
 for pair in hi[1:]:
 pair[1] = '1' + pair[1]
 heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
 return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))

However I've noticed that my new procedure is incorrect: it gives the top nodes the longest codewords instead of the final leaves, and doesn't produce the same tree for permutations of input symbols i.e. the following don't produce the same tree (when run with new encoding function):

input1 = [(1,0.25),(0,0.25),(0,0.25),(0,0.125),(0,0.125)]
input2 = [(0,0.25),(0,0.25),(0,0.25),(1,0.125),(0,0.125)]

I'm finding I'm really bad at avoiding this kind of off-by-one/ordering bugs - how might I go about sorting this out in the future?

Question 2

Can you post some of the bad output? Especially for the last part.

Question 3

Also at a glance I'd suspect that your error is here lo, hi = sorted([heappop(tree), heappop(tree)]). Both of the elements here are Nodes, and you've defined __cmp__ for Nodes as cmp(self.code, a.code) but you haven't set the codes yet so it's always comparing two empty strings. I'm not sure if a heap is really something you need here, aren't you just treating the elements as a list, sorted from smallest to largest? I forget how Huffman encoding works so I'm not sure if that's the right approach.

Question 4

@PatrickCollins I used a heap in my original 'class free' code as it always pops the lowest weight items first (that's how a Huffman Tree is created). The issue was going from the first code to a class, where I made a mistake I couldn't correct - because I did't understand cmp.

Question 5

Again, if the only reason you're using the data structure is to make sure its sorted, I don't see what the advantage over using a sorted list is. The point of the tree data structure in the original algorithm is to give you faster decoding, not encoding.

Question 6

@PatrickCollins I'm using it as the data structure for a statistics algorithm - which won't work for a sorted list.

Question 7

There's more than one oddity ;-) in this code, but I think your primary problem is this:

def __cmp__(self, a):
 return cmp(self.code, a.code)

Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.

def __cmp__(self, a):
 return cmp(self.weight, a.weight)

For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?

Inside the loop, this is strained:

lo, hi = sorted([heappop(tree), heappop(tree)])

Given the repair to __cmp__, that's easier as:

lo = heappop(tree)
hi = heappop(tree)

Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.

I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.

More

About:

it gives the top nodes the longest codewords instead of the final leaves,

This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.

You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.

So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:

from heapq import heappush, heappop, heapify
class Node(object):
 def __init__(self, weight, left, right):
 self.weight = weight
 self.left = left
 self.right = right
 self.code = None
 def __cmp__(self, a):
 return cmp(self.weight, a.weight)
class Symbol(object):
 def __init__(self, name, weight):
 self.name = name
 self.weight = weight
 self.code = None
 def __cmp__(self, a):
 return cmp(self.weight, a.weight)
def encode(symbfreq):
 # return pair (sym2object, tree), where
 # sym2object is a dict mapping a symbol name to its Symbol object,
 # and tree is the root of the Huffman tree
 sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
 tree = sym2object.values()
 heapify(tree)
 while len(tree) > 1:
 lo = heappop(tree)
 hi = heappop(tree)
 heappush(tree, Node(lo.weight + hi.weight, lo, hi))
 tree = tree[0]
 def assigncode(node, code):
 node.code = code
 if isinstance(node, Node):
 assigncode(node.left, code + "0")
 assigncode(node.right, code + "1")
 assigncode(tree, "")
 return sym2object, tree
i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
 print v.name, v.code

That prints:

So, as hoped, the symbols with the highest weights have the shortest codes.

Question 8

thanks! This is an algorithm for coin weighing problems, where the counterfeit is given the value 1. Could I modify the dict to be a tuple as I have repeated values.

Question 9

Sure! No essential use of the dict is made; a dict mapping symbols to Symbol objects would simply be convenient for most applications. You could, for example, change sym2object = ... to symbols = tuple(Symbol(sym, w) for sym, w in symbfreq), and then replace the next line with tree = list(symbols). If you edit your question to give exact examples of inputs and their desired outputs, I can try to give you exact code to get that.

Question 10

thanks for the offer, but I actually managed to get what I needed working myself today.

Question 11

Sorry to bother you again but the sorting was supposed to make sure that the relative position of the original array wasn't changed by the encoding procedure (i.e. the data in the 0th bin is placed in the leftmost leaf). How would I accomplish this? I've edited the question to reflect what I need.

Question 12

Please open a new question - this one is already long and confused ;-) And rather than try to describe what you want, give examples of the exact inputs and outputs you want. I don't understand your description.

Question 13

I suspect the problem is in segment: -

lo.code = '0'+lo.code
hi.code = '1'+hi.code

both hi & low can be intermediate nodes where as you need to update the codes at the leaves where the actual symbols are. I think you should not maintain any codes at construction of the huffman tree but get the codes of individual symbols after the construction of huffman tree by just traversing.

Heres my pseudo code for encode: -

 tree = ConstructHuffman(); // Same as your current but without code field
 def encode(tree,symbol): 
 if tree.data == symbol : 
 return None
 else if tree.left.data.contains(symbol) :
 return '0' + encode(tree.left,symbol)
 else : 
 return '1' + encode(tree.right,symbol)

Calculate codes for all symbol using above encode method and then u can use them for encoding.

Note: Change comparision function from comparing codes to comparing weights.

Tim Peters 71.4k14 gold badges133 silver badges140 bronze badges · Accepted Answer · 2013-11-17 01:53:18Z

There's more than one oddity ;-) in this code, but I think your primary problem is this:

def __cmp__(self, a):
 return cmp(self.code, a.code)

Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.

def __cmp__(self, a):
 return cmp(self.weight, a.weight)

For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?

Inside the loop, this is strained:

lo, hi = sorted([heappop(tree), heappop(tree)])

Given the repair to __cmp__, that's easier as:

lo = heappop(tree)
hi = heappop(tree)

Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.

I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.

More

About:

it gives the top nodes the longest codewords instead of the final leaves,

This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.

You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.

So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:

from heapq import heappush, heappop, heapify
class Node(object):
 def __init__(self, weight, left, right):
 self.weight = weight
 self.left = left
 self.right = right
 self.code = None
 def __cmp__(self, a):
 return cmp(self.weight, a.weight)
class Symbol(object):
 def __init__(self, name, weight):
 self.name = name
 self.weight = weight
 self.code = None
 def __cmp__(self, a):
 return cmp(self.weight, a.weight)
def encode(symbfreq):
 # return pair (sym2object, tree), where
 # sym2object is a dict mapping a symbol name to its Symbol object,
 # and tree is the root of the Huffman tree
 sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
 tree = sym2object.values()
 heapify(tree)
 while len(tree) > 1:
 lo = heappop(tree)
 hi = heappop(tree)
 heappush(tree, Node(lo.weight + hi.weight, lo, hi))
 tree = tree[0]
 def assigncode(node, code):
 node.code = code
 if isinstance(node, Node):
 assigncode(node.left, code + "0")
 assigncode(node.right, code + "1")
 assigncode(tree, "")
 return sym2object, tree
i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
 print v.name, v.code

That prints:

So, as hoped, the symbols with the highest weights have the shortest codes.

thanks! This is an algorithm for coin weighing problems, where the counterfeit is given the value 1. Could I modify the dict to be a tuple as I have repeated values.
Sure! No essential use of the dict is made; a dict mapping symbols to Symbol objects would simply be convenient for most applications. You could, for example, change sym2object = ... to symbols = tuple(Symbol(sym, w) for sym, w in symbfreq), and then replace the next line with tree = list(symbols). If you edit your question to give exact examples of inputs and their desired outputs, I can try to give you exact code to get that.
thanks for the offer, but I actually managed to get what I needed working myself today.
Sorry to bother you again but the sorting was supposed to make sure that the relative position of the original array wasn't changed by the encoding procedure (i.e. the data in the 0th bin is placed in the leftmost leaf). How would I accomplish this? I've edited the question to reflect what I need.
Please open a new question - this one is already long and confused ;-) And rather than try to describe what you want, give examples of the exact inputs and outputs you want. I don't understand your description.

CollectivesTM on Stack Overflow

Huffman encoding issue

2 Answers 2

More

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

More

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related