I need to search some data encoded by a Huffman tree, but not by simply walking the tree: I need test combinations of the data for a property and continue searching based on whether the test is positive or not. To this end, I've also returned the intermediate steps of the Huffman algorithm>
Here is the code I use to generate the extended tree:
import heapq
def encode(symbfreq):
tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
heapq.heapify(tree)
while len(tree)>1:
lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))
def next_power_of_two(n):
return int(2**( ceil(log(n,2))))
def full_encode(tree):
huffman_tree = encode(tree)
complete_tree = huffman_tree
get_intermediate_node = lambda val, arr : ''.join( [ char for char,binary in itertools.ifilter( lambda node : node[1].startswith(val),arr)] )
for val in range( next_power_of_two( len(huffman_tree) ) ):
bvalue = bin(val)[2:]
node = [ get_intermediate_node( bvalue , huffman_tree) , bvalue ]
if node not in complete_tree:
complete_tree.append( node)
complete_tree =[y for y in complete_tree if y[0]!='']
complete_tree = sorted( complete_tree , key=lambda p: (len(p[-1]), p) )
return complete_tree
So for example this input:
tree = [('0',0.25),('0',0.25),('0',0.25),('0',0.125),('1',0.125)]
produces this output:
tree = [['00', '0'], ['0', '00'], ['0', '01'], ['0', '10'], ['0', '110'], ['1', '111']]
Once I've done that, I need to search the tree. Since I've got access to all the intermediate stages, I start by checking whether the data in the first leaf contains a '1' (this leaf is encoded by '0'): if this is true then I check the next leaf which has a '0' in the second position of it's code. If false, I check the leaf whose begins with '10'. I keep on doing this until I've found the leaf (and the encoding) which has only a '1' in the data. The code is below:
#searching an extended huffman list 0=>left branch 1=>right branch
def search_huff_list(complete_tree, max_depth):
defective = 0
loops = 0
stage = 0
code = ['0']
while defective == 0:
loops += 1
current = complete_tree[stage]
print(stage)
if current[0] == '1':
defective = complete_tree[stage]
return defective,loops
if len(current[1]) == max_depth:
if current[0]=='1':
defective = complete_tree[stage]
return defective, loops
else:
defective = complete_tree[stage+1]
return defective, loops
if not '1' in current[0]:
code[-1] = '1'
code.append('0')
partial_code = ''.join(code)
print(partial_code)
stage = complete_tree.index(next(x for x in complete_tree if x[1].startswith(partial_code) ) )
else:
code.append('0')
partial_code = ''.join(code)
stage = complete_tree.index(next(x for x in complete_tree if x[1].startswith(partial_code) ) )
return 0
For the input above, this algorithm finds the '1' (it's easier to debug, if the labels are 'a','b','c','d','e'). I'm building up a partial code and searching the original tree (the extended list) for the first code that begins with that series of bits.
The main problem I have is that this is that this algorithm is complicated enough already: yet I've got to get it to find a 1 in a random tree next. There's already enough corner-case if-else catching going on (for example if I get right down to the end, and the test on the left branch comes back negative, then I know the '1' is in the right branch and I don't need to do another test).
What could I do to make the code more readable/easier to debug? I guess what I'm actually doing isn't that efficient, but I'm struggling to think of another way to do it.
-
\$\begingroup\$ I don't understand what you are trying to achieve here. Can you explain the problem you are trying to solve? \$\endgroup\$Gareth Rees– Gareth Rees2013年11月06日 11:31:34 +00:00Commented Nov 6, 2013 at 11:31
-
\$\begingroup\$ @GarethRees Hi Gareth, I've modified the data structure so that the groups are encoded in the Huffman tree. The new code can be found here: stackoverflow.com/questions/19817863/sorting-huffman-leaves \$\endgroup\$Tom Kealy– Tom Kealy2013年11月06日 21:53:57 +00:00Commented Nov 6, 2013 at 21:53
-
\$\begingroup\$ @GarethRees I'm trying to solve a variant of the 12-coin problem, but where you believe some coins are more likely to be counterfeit than others - I'm encoding that info in a Huffman tree, and then looking for a sorting/searching algorithm to find the counterfeit. en.wikipedia.org/wiki/Counterfeit_coin_problem \$\endgroup\$Tom Kealy– Tom Kealy2013年11月06日 21:55:33 +00:00Commented Nov 6, 2013 at 21:55
1 Answer 1
I've change my data structure, so that the Huffman encoding procedure defines the combinations I need to test:
class Node(object):
left = None
right = None
weight = None
data = None
code = ''
length = len(code)
def __init__(self, d, w, c):
self.data = d
self.weight = w
self.code = c
def set_children(self, ln, rn):
self.left = ln
self.right = rn
def __repr__(self):
return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
def __cmp__(self, a):
return cmp(self.code, a.code)
def __getitem__(self):
return self.code
def encode(symbfreq):
tree = [Node(sym,wt,'') for sym, wt in symbfreq]
heapify(tree)
while len(tree)>1:
lo, hi = sorted([heappop(tree), heappop(tree)])
lo.code = '0'+lo.code
hi.code = '1'+hi.code
n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
n.set_children(lo, hi)
heappush(tree, n)
return tree[0]
Then I search the groups using a function like this:
def search(tree):
current = tree.right
loops = 0
#defective = 0
while current is not None:
#print(current)
loops+=1
previous = current
if current.data == 1:
current = current.left
else:
current = current.right
return loops,previous
There are still some changes I need to make: the data field needs to be a set (not the sum that's there currently), and I need to change the encoding strategy so that the left branch of each node always has a '0' appended to its code regardless of permutations of the labels of the input distribution.