Decoding a Huffman code with a dictionary

Question 1

I need to decode a Huffman code I coded with my program using a file containing the translation beetween ASCII and Huffman bits. I have already a dictionary in the progam from "codes" to ASCII like this one:

{'01110': '!', '01111': 'B', '10100': 'l', '10110': 'q', '10111': 'y'}

I created the function:

def huffmanDecode (dictionary, text) :

That needs the dictionary and the code. I have tried searching the text for key in the dictionary and using both the replace method form string and the sub from re but neither of them decodes the message properly. for example if the code is:

011111011101110

It should be straightforward to decode it to:

By!

But I haven't been able to do it by iterating over the code and searching matches in the dictionary!

How can I decode the code using the keys and their values in the dictionary by finding the key inside the text and replacing it by its value?

Any help is greatly appreciated.

Question 2

Show your decoding code then. Manually: 011111011101110 -> 01111 = B, 10111 = y, and 01110 = !, all correct according to your own table.

Question 3

using the bitarray module you get huffman en-/de-coding for free and probably more efficient than anything else:

from bitarray import bitarray
huffman_dict = {
 '!': bitarray('01110'), 'B': bitarray('01111'),
 'l': bitarray('10100'), 'q': bitarray('10110'),
 'y': bitarray('10111')
}
a = bitarray()
a.encode(huffman_dict, 'Bylly!')
print(a)
dec = bitarray('011111011101110').decode(huffman_dict)
print(dec)
print(''.join(dec))
# # output:
# bitarray('011111011110100101001011101110')
# ['B', 'y', '!']
# By!

if you do not want to install the module, read the part below.

here is a variant using the huffman tree to decode - the program works but there may be better variants to represent a binary tree (i chose a tuple).

this version may be better suited when your codewords are not all of the same length. the other nice thing about the binary tree is that here it is obvious that the code is prefix-free.

your code in tree-form looks like this (overly indented in order to make the tree-structure visible):

huffman_tree = \
 ( # 0
 ( # 00
 None,
 # 01
 ( # 010
 None,
 # 011
 ( # 0110
 None,
 # 0111
 ( # 01110
 '!',
 # 01111
 'B')))),
 # 1
 ( # 10
 ( # 100
 None,
 # 101
 ( # 1010
 ( # 10100
 'l',
 # 10101
 None
 ),
 # 1011
 ( # 10110
 'q',
 # 10111
 'y'))),
 # 11
 None))

using that you can then decode with:

def huffman_decode(strg):
 ret = ''
 cur_node = huffman_tree
 for char in strg:
 cur_node = cur_node[int(char)]
 if cur_node is None:
 raise ValueError
 elif isinstance(cur_node, str):
 ret += cur_node
 cur_node = huffman_tree
 return ret
print(huffman_decode('011111011101110'))

if the decoding hits None some error occurred and ValueError is raised. as soon as the decoding hits a string, the current node cur_node is reset to the 'root node' and the game begins from the beginning of the tree.

and just because i can: here is a visual display of your (incomplete) huffman tree (this may help understand what the algorithm does: whenever you encounter a 0: go right+down; whenever you encounter a 1: go right+up); if you hit an end-node: return the character at that node and restart at the root node. binary huffman tree

Question 4

Nice, learned something new. But why do you thing the re.match/startswith approach does not work for variable length bit sequences? Seems to work fine for me. Isn't the idea behind huffman codes that no code is a prefix of another?

Question 5

@tobias_k sure, the prefixes are all different. but if you have transmission errors (and a huffman code that allows re-syncing) i found the tree approach more suited. but you are right: deleted the 'regex not suited' part from my answer... did not consider efficiency though. did you?

Question 6

Not sure what you tried, but re.sub or replace probably did not work because they do not necessarily replace from the beginning of the string. You have to see what code the string starts with, then replace that code, and proceed with the rest of the string.

For example, like this:

def huffmanDecode (dictionary, text):
 res = ""
 while text:
 for k in dictionary:
 if text.startswith(k):
 res += dictionary[k]
 text = text[len(k):]
 return res

Or recursively:

def huffmanDecode (dictionary, text):
 if text:
 k = next(k for k in dictionary if text.startswith(k))
 return dictionary[k] + huffmanDecode(dictionary, text[len(k):])
 return ""

You could also make a regex from your codes and use re.match to find the next one:

import re
def huffmanDecode (dictionary, text):
 p = "|".join(dictionary) # join keys to regex
 res = ""
 while text:
 m = re.match(p, text)
 res += dictionary[m.group()]
 text = text[len(m.group()):]
 return res

Note: Neither of those have proper error handling and will fail or loop forever if the codes do not match the message, but this should get you started.

Question 7

THANKKKKK YOUUUUU!!!! You're a life saver man! I had been working in this since Friday and could not figure it out! You were right my logic with replace was not appropriate, the key was searching from the beginning.

hiro protagonist 47.4k17 gold badges93 silver badges119 bronze badges · Accepted Answer · 2015-10-12 21:25:23Z

using the bitarray module you get huffman en-/de-coding for free and probably more efficient than anything else:

from bitarray import bitarray
huffman_dict = {
 '!': bitarray('01110'), 'B': bitarray('01111'),
 'l': bitarray('10100'), 'q': bitarray('10110'),
 'y': bitarray('10111')
}
a = bitarray()
a.encode(huffman_dict, 'Bylly!')
print(a)
dec = bitarray('011111011101110').decode(huffman_dict)
print(dec)
print(''.join(dec))
# # output:
# bitarray('011111011110100101001011101110')
# ['B', 'y', '!']
# By!

if you do not want to install the module, read the part below.

here is a variant using the huffman tree to decode - the program works but there may be better variants to represent a binary tree (i chose a tuple).

this version may be better suited when your codewords are not all of the same length. the other nice thing about the binary tree is that here it is obvious that the code is prefix-free.

your code in tree-form looks like this (overly indented in order to make the tree-structure visible):

huffman_tree = \
 ( # 0
 ( # 00
 None,
 # 01
 ( # 010
 None,
 # 011
 ( # 0110
 None,
 # 0111
 ( # 01110
 '!',
 # 01111
 'B')))),
 # 1
 ( # 10
 ( # 100
 None,
 # 101
 ( # 1010
 ( # 10100
 'l',
 # 10101
 None
 ),
 # 1011
 ( # 10110
 'q',
 # 10111
 'y'))),
 # 11
 None))

using that you can then decode with:

def huffman_decode(strg):
 ret = ''
 cur_node = huffman_tree
 for char in strg:
 cur_node = cur_node[int(char)]
 if cur_node is None:
 raise ValueError
 elif isinstance(cur_node, str):
 ret += cur_node
 cur_node = huffman_tree
 return ret
print(huffman_decode('011111011101110'))

if the decoding hits None some error occurred and ValueError is raised. as soon as the decoding hits a string, the current node cur_node is reset to the 'root node' and the game begins from the beginning of the tree.

and just because i can: here is a visual display of your (incomplete) huffman tree (this may help understand what the algorithm does: whenever you encounter a 0: go right+down; whenever you encounter a 1: go right+up); if you hit an end-node: return the character at that node and restart at the root node. binary huffman tree

Nice, learned something new. But why do you thing the re.match/startswith approach does not work for variable length bit sequences? Seems to work fine for me. Isn't the idea behind huffman codes that no code is a prefix of another?
@tobias_k sure, the prefixes are all different. but if you have transmission errors (and a huffman code that allows re-syncing) i found the tree approach more suited. but you are right: deleted the 'regex not suited' part from my answer... did not consider efficiency though. did you?

CollectivesTM on Stack Overflow

Decoding a Huffman code with a dictionary

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related