Reading bit by bit for Huffman Compression

Question 1

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w

Question 2

This is kind of confusing, do you want to solve this by reading bit-by-bit or by modifying your Huffman decoding so it takes bytes?

Question 3

I'm not sure what to do, because Huffman coding takes bit-by-bit. Computer programs, however, handle memory byte-by-byte.

Question 4

Well not necessarily, there are plenty of techniques that let a Huffman decoding routine work with whole bytes (but they aren't tree-walking and that's likely what you had in mind?)

Question 5

I'm new to Huffman, so any link for more information on that byte-routine would be appreciated!

Question 6

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).

In the simplest case, table based decoding needs a "window" of the bit stream that is as large as¹ the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long²), which can be created by shifting a buffer to the right until it has the correct size:

window = buffer >> (maxCodeLen - bitsInBuffer)

Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:

while bitsInBuffer < maxCodeLen:
 buffer = (buffer << 8) | readByte()
 bitsInBuffer += 8

Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).

Decoding itself could look like this:

# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol

This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:

# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
 table[topBits | bottom] = (symbol, codeLen)

By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.

1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.

2: eg 15 bits max for Deflate

Question 7

Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

user555045 65.8k6 gold badges96 silver badges181 bronze badges · Accepted Answer · 2018-03-29 04:12:24Z

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).

In the simplest case, table based decoding needs a "window" of the bit stream that is as large as¹ the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long²), which can be created by shifting a buffer to the right until it has the correct size:

window = buffer >> (maxCodeLen - bitsInBuffer)

Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:

while bitsInBuffer < maxCodeLen:
 buffer = (buffer << 8) | readByte()
 bitsInBuffer += 8

Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).

Decoding itself could look like this:

# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol

This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:

# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
 table[topBits | bottom] = (symbol, codeLen)

By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.

1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.

2: eg 15 bits max for Deflate

CollectivesTM on Stack Overflow

Reading bit by bit for Huffman Compression

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related