Skip to main content
Code Review

Return to Answer

Commonmark migration
Source Link

#Header size

Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

  • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
  • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
  • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

Encoding process

Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.

#Decoding process

Decoding process

Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.

Some possible relevant sources for these techniques are the one in the previous paragraph and:

#Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

  • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
  • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
  • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.

#Decoding process

Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

  • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
  • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
  • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

Encoding process

Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.

Decoding process

Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.

Some possible relevant sources for these techniques are the one in the previous paragraph and:

Source Link
user555045
  • 12.2k
  • 1
  • 19
  • 37

#Header size

256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:

  • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
  • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
  • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)

#Encoding process

Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.

#Decoding process

Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.

Some possible relevant sources for these techniques are the one in the previous paragraph and:

lang-c

AltStyle によって変換されたページ (->オリジナル) /