1

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w

asked Mar 29, 2018 at 3:22
4
  • This is kind of confusing, do you want to solve this by reading bit-by-bit or by modifying your Huffman decoding so it takes bytes? Commented Mar 29, 2018 at 3:30
  • I'm not sure what to do, because Huffman coding takes bit-by-bit. Computer programs, however, handle memory byte-by-byte. Commented Mar 29, 2018 at 3:41
  • Well not necessarily, there are plenty of techniques that let a Huffman decoding routine work with whole bytes (but they aren't tree-walking and that's likely what you had in mind?) Commented Mar 29, 2018 at 3:41
  • I'm new to Huffman, so any link for more information on that byte-routine would be appreciated! Commented Mar 29, 2018 at 3:44

2 Answers 2

2

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).

In the simplest case, table based decoding needs a "window" of the bit stream that is as large as1 the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long2), which can be created by shifting a buffer to the right until it has the correct size:

window = buffer >> (maxCodeLen - bitsInBuffer)

Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:

while bitsInBuffer < maxCodeLen:
 buffer = (buffer << 8) | readByte()
 bitsInBuffer += 8

Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).

Decoding itself could look like this:

# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol

This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:

# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
 table[topBits | bottom] = (symbol, codeLen)

By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.


1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.

2: eg 15 bits max for Deflate

answered Mar 29, 2018 at 4:12
Sign up to request clarification or add additional context in comments.

Comments

0

Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

answered Mar 29, 2018 at 3:40

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.