Return to Answer

Commonmark migration

edited Jun 10, 2020 at 13:24

#Header size

Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

Encoding process

Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.

#Decoding process

Decoding process

Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Introduction to table based Huffman decoding
An efficient algorithm of Huffman decoder with nearly constant decoding time
Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)

#Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

#Decoding process

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Introduction to table based Huffman decoding
An efficient algorithm of Huffman decoder with nearly constant decoding time
Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)

Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

Encoding process

Decoding process

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Introduction to table based Huffman decoding
An efficient algorithm of Huffman decoder with nearly constant decoding time
Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)

Source Link

answered Dec 31, 2018 at 19:05

user555045

answered Dec 31, 2018 at 19:05

user555045

12.2k
1
19
37

#Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

#Decoding process

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Introduction to table based Huffman decoding
An efficient algorithm of Huffman decoder with nearly constant decoding time
Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)

lang-c