RealTime Data Compression

Friday, February 14, 2014

FSE decoding : wrap up

With all the previous articles written on FSE, it seems we now have all elements to build a complete FSE decoder. Let's wrap up all this learning into a single piece.

To start with, we consider that all symbols are attributed a weight, of minimum 1, normalized over a sum which is a power of 2 (8, 16, 32, etc.).
The size of the decoding table is this sum.

Each symbol will appear in the table exactly as many times as it weights.

We have seen that, to maximize table's capability to capture fractional bits, we have to spread each symbol over the table, at roughly regular distances (the more regular, the better the compression performance).
For example, using the previous example of a symbol A with weight 5/16 :

We have seen that destination sub-ranges must sum up to cover the complete table, with each sub-range having a size in power of 2, defining an exact number of bits to use. We have seen that the smaller ranges must be placed at the beginning.

We now must associate each symbol in the table with a destination range.
To do that, we will number the ranges, beginning with the larger ones.

So the final assembling becomes :

If you build a decoding table using the above construction method, you have a simple decoding algorithm as a result. The hot loop is a mere 4-lines, as follows :

// Persistent variables across the loops : state and bitStream
BYTE symbol = decodeTable[state].symbol;
int nbBits = decodeTable[state].nbBits;
int rest = getBits (bitStream, nbBits);
state = decodeTable[state].newStateBaseline + rest;

This is all it takes to build your own FSE decoder.
You can always use the code at Github for further guidance.

Next part will be a study of the compression algorithm part of FSE, which is a bit more complex...

Posted by Cyan at 11:02 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Sunday, February 9, 2014

FSE : distributing symbol values

Following the last article on defining optimal sub-ranges for FSE, and attributing them according to symbol ranking, we were left with the question of how to position symbol values into the state table.

We already said that an FSE decoding table following ANS theory is equivalent to an Huffman Low-bit-order decoding table : a symbol with a probability of 50% will be positioned once every 2 cells. With a probability of 25%, it will be positioned once every 4 cells. And so on.

Obviously, the next question is : what if the probability of a symbol is not a power of 2 ?
A first obvious idea is to follow the same logic : if, for example, a symbol has a probability of 33.333%, it should be positioned once every 3 cells.

OK, but even 33% is an easy target, because it's such a convenient fractional number. What if we have, for example, a probability of 7.27%, resulting in an average distance between symbols at every 13.75515.... cells ?
Well, an intuitive idea is that we will try to approach this distribution, typically by spacing symbols sometimes by 13 cells, sometimes by 14 cells. And indeed, it works.

OK, it may look clear and simple for a single symbol. But what about all other symbols together, trying to get their own regularly spaced slots ? Won't they be fighting for the same slots ?

Let's take an example, to illustrate the situation. Let's imagine that, on a 16-states decoding table, we have the following distribution :

A : 5 ; B : 5 ; C : 3 ; D : 3

In this case, we have a first issue, because 16/5=3.2, which obviously is not an integer. So we will have to space A by a distance of 3.2, which we can only approximate, by being 3 most of the time, and maybe 4 sometimes.

The intuitive optimal distribution for A is reached by applying this algorithm :
Position 0 : at P/2 (=1.6, so rounded down to 1 for example)
Position n = position n-1 + 3.2
Which gives : 1 (1.6), 4 (4.8), 8 (8.0), 11 (11.2), 14 (14.4)
(Edit : Charles Bloom calls this method "bias 0.5", and also note that "bias 1.0" actually gives better results, which feels strange, as "bias 0.5" is supposed to be "optimal". This result is likely related to the fact that positions in the state table are not equal in term of contribution to probability, with the lower states being worth more than the higher ones. See follow up investigation )

Yeah, this would be fine enough, except that, A is not alone.
What about B ? It has the same probability, so it should occupy the same positions ?
Of course not, only one symbol can occupy a state, so let's say that B is kind enough to move its own positions up by 1, leading to 2, 5, 9, 12, 15.

OK, it looks fine, so now, what about C ?
C distribution is 3, so Pos0(C) = (16/3) / 2 = 5.33/2 = 2.66 => 2.
Humm, that's a problem, 2 is already used by B.
OK, let's say B has higher priority because it should have 1.6 to begin with, so it's lower than 2.66. So we keep 2 for B, and give 3 for C.

And so on. Add D to the mix for even more fun. D basically takes whatever cell remains for its own use, wherever they are.

As you can see, it seems to be a mess. And it is : with more and more symbols filling the table, space is necessarily becoming scarce for later symbols. Notice how D ends up receiving 2 positions clustered together, instead of being nicely spread.

Thankfully, a "perfect" distribution algorithm is possible : for each position, compare all fractional state values of all symbols, pick the lowest one, increase its value by step=1/probability, and start again the same process for next state. This will produce a lot of ties, but one could use symbol order to solve such cases.
It works.
It's just damn slow, due to the huge number of comparisons required.

It can be sped up by using bucket sorting strategy. An interesting method is to use more buckets than states, for easier tie management. It works fine too, and drastically reduces the number of comparisons involved, at the cost of a bit of accuracy. But that's okay, efficiency remains good enough, "close to Shannon limit", which means that precise initialization is actually not required.

Let's double up on this sentence : precise initialization is not required.
Indeed, the only point is to reduce the amount of "waste" in fractional bit estimation. Where does that waste come from ?
Well, as said in a previous article, a fractional bit is translated into a "decrease" of the state value.

For this to work correctly, i.e. to minimize the fractional bit cost, we need to have our target symbol approximately at the intended destination state value. If not, then we will "waste" more fractional bit, in order to find the intended symbol.

With this concept in mind, the worst thing that can happen is to "cluster" the values together at one end of the range : it maximizes the probability that state value will have to be decreased more than necessary to find the next symbol.

On the other hand, if a symbol with a probability 50% is distributed evenly, once every 2 cells, finding it is highly simplified, and even guaranteed at a distance of at most 1.
The above explanation is a bit simplified, but works well at explaining how fractional bit works.

So, it's not very important to be precise, what matters is to be well "spread" across the spectrum of state values.

With this idea in mind, we can concentrate on the next objective : speed.
Rather than trying to find the best distribution, we'll try to find one which is "good enough" at spreading symbols across the table, and most importantly fast to execute.

With some experiment, I ended up selecting this simple formula :
nextState = (currentState + (5/8) range + 3) % range
which features a pretty good spreading power, and is way faster to run than all previous proposals, thanks to the absence of comparisons, branches, and the guarantee to reach all state values in a single scan. Using the above example, this solution produces the following distribution :

in a word, good enough.
Note that, on real implementation, result looks even better than this example, due to the use of larger tables. We are limited here, for visibility, at 16 states, but the heuristic has been tuned for larger sizes, 512 states and beyond.

In Charles Bloom's article, there's a different distribution method recommended, closer to optimal, backed with experimental results. There's no contradiction : this blog is called "fast compression" for a reason, and that's why in this recommendation, speed takes precedence.

Posted by Cyan at 11:00 PM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Wednesday, February 5, 2014

FSE : Defining optimal subranges

Note : Charles Bloom started an independent in-depth analysis of ANS on his blog, which is definitely worth the read.

As stated in earlier FSE comparison with Arithmetic Coding, FSE deals with fractional bits by "decreasing" its state value. Now how does that work precisely ?

Let's use again the example of a symbol which probability of appearance is 5/4096. The Shannon limit for this symbol is 9.68 bits per symbol (if we do better, then some other symbols will suffer, and the overall will compress less). As stated earlier, we deal with the fractional bit issue by outputting sometimes 9 bits, sometimes 10 bits.

The distribution of 9-bits & 10-bits sub-ranges must ensure that all possible values remain accessible after the current symbol. It's a condition for lossless compression. The result is the following layout :

OK, so why are the 9-bits subranges positioned at the beginning of the range ?
Well, in previous article, we said that an additional bit is read when the state is so much decreased that is goes below lower limit. In which case, it just "wraps around", starting again from the top value, at the cost of an additional bit read.

So it's not so much that the 9-bits are at the bottom. Rather, with above explanation, it is more intuitive to explain that the 10-bits are at the top.
Maybe it would be easier to notice with another example, a symbol with 7/4096 probability of appearance.

Note that these subranges are "destination ranges". Each of the 7 symbol_cells are mapped to one and only one of these ranges.

So which symbol is associated with each range ?

As stated earlier, since the additional bit is "paid for" when crossing the low limit, the most likely symbol to trigger the limit is lowest one.
So we give an index to each of the 7 symbol_cells, in their order of appearance, from the smaller to the larger one.
The largest 10-bits range is attributed to symbol_cells n°1.
The other ranges simply follow, in their natural order :

This is coherent with an optimal cost associated with a 7/4096 distribution : 9.19 bits.
It means that we are only consuming 0.19 fractional bits each time the symbol is decoded. So we can decode it about 5 times at 9-bits before requiring 10-bits.

This is in contrast with a probability of 5/4096 :

Here, the optimal cost is 9.68 bits, so fractional bits is much higher (0.68 bits). We are likely to cross the low limit much more often. Hence the first 3 symbol_cells trigger the additional bit read, and require 10 bits.

This rule is valid for any probability; so now, whether it is 5/4096, 134/4096, or even 3812/4096, you know how to define destination sub-ranges, and can build the associated decoding table.
(The latest example is interesting, since some sub-ranges will be 0-bits large, which means they map directly to another state value : 3812/4096 => 284 1-bit sub-ranges + 3528 0-bit sub-ranges).

What remains to be understood is how to distribute the symbol states. A naive approach would be to simply cluster them together. For example, a symbol with probability of 7/4096 would occupy state values from 0 to 6. It would work, but compression effectiveness would be very poor.
In another blog post, we'll see why, and how to correct that.

Posted by Cyan at 4:05 PM 3 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Monday, February 3, 2014

A comparison of Arithmetic Encoding with FSE

Arithmetic encoding was invented in the 70's by Jorma Rissanen. After a few decades during which the technique almost stagnated, partly as a side-effect of abusive patent strategy, it started to take off in the early 2000's, and is now clearly established as the "current generation" entropy coding stage, thanks to its much better compression effectiveness than Huffman.

Since FSE claims to achieve equivalent compression performance but without the speed penalty, it's interesting to compare them, and see what makes them similar, or different.

The basic concept behind Arithmetic Encoding is the simple fact that shifting a bit is equivalent to a division by 2.

So now, it becomes clear that Huffman is just a special case of Arithmetic Encoding : since Huffman only shift bits, it's equivalent to divisions by power of 2.
By fully acknowledging this equivalence, it becomes obvious that it's also possible to divide by any other number, unlimited by the power of 2 requirement. This is how fractional bit is achieved. You can divide by 1.11, and you achieved the equivalent of a 0.1 bit cost.

The illustration above, taken from Wikipedia, is relatively clear, although it may be even simpler to look at an Huffman one. Let's use again this simple distribution :

Each symbol has an offset (or start position) and a range.
A : start 0, range 4 => factor 2
B : start 4, range 2 => factor 4
C : start 6, range 1 => factor 8
D : start 7, range 1 => factor 8

The "factor" is simply the total divided by the range.

In order to encode a symbol, arithmetic encoding always consider that it's selecting a subrange between 0 and 1, eventually rescaled to match a distribution table. In this case, the table has a size of 8, so we have to output a value between 0 and 7.
To encode symbol B, we restrict our new range between 4 and 6 (excluded). So we first settle with 4, because it is the start value of B.
Then we look at the divisor : we only have 2 positions remaining, so in order to find again the full range of possibility, we have to "upscale" this range by the factor. So we multiply by 4 (e.g. shift left by 2 bits), and find again a range of 8 values to encoded.
We can start again, but now, we have 2 bits which are already determined, occupying higher bit position ('10').

This example is very easy because we have only power of 2. But the exact same mechanism works with any distribution. You could even have a total which is a prime number, and any kind of distribution into it (such as 9,7,1,2 for a total of 19). It would work same : scale the value to 19 slots, add the start value, rescale by the factor, and so on.
Except that, you will have to face rounding errors. In a world of unlimited precision, this would not matter, but alas, that's not how computer works. You have to scale arithmetic precision to a bounded accuracy. It's not necessarily a big issue, since what truly matters is to do the same rounding error on both encoding and decoding (which is a significant problem in itself, especially in a cross-platform environment). Beside, it's equivalent to a small compression loss, and therefore it's important to limit that loss as much as possible.
A common way to do that is to "spread" the segment to a virtual size as large as possible. So, in previous example, we will not have a value between 0 and 18, but rather between 0 and 2^31-1 for example. We accept a small rounding error in doing it, but it's very acceptable.

Now that we have described encoding, let's look at how decoding works.

It's almost the same operation. To simplify decoding, we have a precomputed decoding table, of known size (8 in our first example). We read our first number, using virtual scale, and then scale it back to distribution table size, decode the symbol (B if it's a 4 or 5), get the new range, rescale, read more bits, and so on.

All this works well, but as can be guessed, cost quite a bit of multiplications and divisions in the process. With several clever tricks (typically ensuring that distribution total is a power of 2), it's possible to decrease these operations to a minimum, but alas, one division remains in the decoder, and several multiplications are also there.

Now let's have a look at ANS/FSE decoder.

We start with basically one identical concept : a bit read is equivalent to a division by 2.

A first difference though, is that a bit read is not bit shift.
In contrast with Huffman or Arithmetic, the "state" value is not read directly from the bitstream, it is a virtual value which is maintained alongside the bitstream.
That being said, it can be almost the same : a symbol with a 50% probability will basically divide state by 2, then discover than one bit must be added to keep state within range, and then shift it (equivalent to arithmetic's rescale), and read one bit from the bitStream.

Another difference is that a fractional bit is obtained by reading a variable number of bits. For example, if a symbol should be written using 3.3 bits (equivalent to 10%), the decoder will sometimes read 3 bits, and sometimes read 4 bits.
So we never "rescale by 10" as would do an arithmetic decoder, but rather sometimes rescale by 8, and sometimes by 16 (which, incidentally, is a better fit for binary representation, reducing rounding errors).

Okay, but how is this decided ?

Well, this is the magic of the decoding table, that will be described into another post, but to give you an idea of the bigger principle, here is how I would describe it in simple terms :
the fractional bit is translated by a "decrease" in state value.

When state value is high, state is just decreased, but it doesn't trigger the "additional bit". So, following our 10% example, it means we just read 3 more bits.

When state value is low, it tends to "cross" the lower limit. As a consequence, one more bit is read, and state is regenerated at its full value.

OK, that's roughly how the decoder behaves, from an external perspective.
But it's not enough to fully describe the exact behavior. You cannot expect to subtract a fixed amount from state in order to get your new state value. Unfortunately, it matters if the next state value is exactly 4 or 5, because it will influence some future symbol. So accuracy is more complex to describe. And that will be explained in a later post.

Posted by Cyan at 3:01 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Tuesday, January 14, 2014

Huffman, a comparison with FSE

In this serie explaining the behavior of Finite State Entropy, I find it interesting to make a quick comparison with Huffman entropy. It's a direct result from the description of its decoding principles, nonetheless I believe it's useful to clearly establish the equivalence since it will be illustrative for future articles.

To better grasp the similarities, let's have a simple example. We'll use the following probabilities, suitable for an Huffman tree :

A : 50% ; B : 25% ; C : 12.5% ; D : 12.5%

This tree only needs 8 values to be properly represented, and gives the following linear representation :

This representation uses the "high bit first" methodology, which means we are starting our node travel from the highest bit.
If it's 0, then we have an A. We stop there, shift left the register by one bit and load a new one at the bottom.
If the highest bit is 1, we now we are not done, and need to read the second bit. If this second bit is 0, we know we have a B. If not we continue reading. And so on.
The decision tree looks as follows :

But we could have done differently : we could have used "low bit first". In this case, the linear representation of the tree looks like this :

Which seems a lot messier, but it's not : we simply reversed the bit order of the decision tree :

The 2 decision trees are therefore completely equivalent.

Now, look again at the "low bit first" linear representation. It's important to state that this is what FSE would produce, but it also embeds a few additional properties :

The "newStateBaseline" does not need to be calculated and stored into the decoding table : it can be automatically determined from the state itself : newStateBaseline = currentState & ~mask(nbBits);
nbBits is stable for a given character : for A, it's always 1, for B, it's always 2, etc. It simplifies the decoding table, which does not need to store this information.

So basically, an Huffman-low_bit_first table looks the same as an FSE table, and it guarantees a stable number of bits per character, and newStateBaseline is automatically deduced from remaining bits.

There is, however, a small but important difference.
An Huffman-low_bit_first would shift right the current state, and load the next bit(s) high. In contrast, the FSE decoder keeps them, and only reload the low bits. This creates a dependency on future state : basically, once you've reached state 2 with FSE for example, you can already guarantee that you won't see a D before you see a C. Which brings the question : how would the encoder select the right state, since it depends on future characters yet to be decoded ?

The answer : the encoder will work in backward direction, encoding characters from last to first, solving dependencies along the way.

To be continued...

Posted by Cyan at 4:11 PM 15 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Thursday, January 9, 2014

FSE decoding : how it works

After announcing the release of FSE and its results, it's time to have a look at its internals.

We'll spend some time first on the decoding loop, since it's both the easiest one to understand and the most important one. Encoding is much more convoluted, and that's where ANS theory comes into play (For some more detailed explanation on how ANS works and its related properties, I invite you to read Jarek's latest paper). But for the time being, we don't need that yet. The decoding loop is generic enough to handle diverse kind of entropy algorithms, including Huffman and Arithmetic, as we'll see later on.

The decoding code can be freely consulted on GitHub. The "hot loop" is pretty short, and can be pseudo-coded like this :


      // One persistent variable across all loops : int state
 outputSymbol (decodeTable[state].symbol);
 intnbBits=decodeTable[state].nbBits;
 intrest= getBits (bitStream, nbBits);
 update (bitStream, nbBits);
 state=decodeTable[state].newState+rest;

There is a single value to track, called state.
state value can range from 0 to a maximum total_number_of_states-1, which is conveniently a power of 2. For example, in FSE default implementation, total_number_of_states is 4096, representing 12 bits.

state initial value is read directly from the bitStream. Then its evolution is a bit more complex.
In contrast with Huffman, which merely shifts current bits and replace them with new ones from the bitStream, FSE completely rewrite state value after each step.

The first thing to keep in mind is that a state value is enough to immediately decode a symbol, and provide the next state value. Both information are directly stored into the decoding table.
If we had the chance to handle a large enough state number, with its associated decoding table, the decoding loop would be trivial :


 outputSymbol (decodeTable[state].symbol);
 state=decodeTable[state].newState;

Obviously, we are not that lucky : a large number representing a complete file or segment would result in a way too large decoding table to be practical. So we have to keep this size reasonable.
In practice, we keep as many bits as necessary to represent probabilities with enough accuracy, for example 12 bits proposed by default in FSE for a 256 alphabet.

Now we have to modify the decoding loop, by introducing a few new bits between each step, keeping the value of state within accessible range. The number of bits to load is also embedded directly into the decoding table.


 outputSymbol (decodeTable[state].symbol);
 int nbBits = decodeTable[state].nbBits;
 state=decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);

A legitimate question at this stage is : why on earth this construction does compress anything ?
Well, we are getting close to it. Consider that the number of bits to read after each symbol is decoded is the cost of compression.

This number of bits is in no way random : it entirely depends on the (normalized) frequency of the decoded character. And here is why.

Let's suppose our range of possible values is 4096 (12 bits). And let's supposed that the normalized frequency of our decoded symbol is 4. That means it should be present about 4 times every 4096 symbols, which is equivalent to 1 times every 1024 symbols, which is equivalent to 10 bits.
That conclusion would have been the same for Huffman : after reading the character, we need 10 new bits to read the next one.
This example is simple because the symbol frequency is a power of 2.

So now, let's imagine that its frequency is 5. That means it's present 5 times in the table of 4096 cells.
According to Shannon theory, we are supposed to use -log2(5/4096) = 9.68 bits to represent this symbol. How can we achieve this ?

Well, we are going to divide the range of accessible values into 5 sub-ranges. Since we are loading full bits, we need each sub-range to be a power of 2. And since the total remain 4096, we need multiple different sub-range sizes.

A simple way to achieve this is to use a combination of 9 bits and 10 bits ranges (sometimes called phase-in codes). For example :

Now, we have 5 sub-ranges, and collectively they cover the full spectrum of possible values.
Each of the 5 symbol occurrence will get assigned one of these ranges, which translates into an offset (0, 512, 1024, etc.), and a number of bits (9 or 10). Hence :
state=decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
At this stage, we have not yet settled any specific rule regarding these ranges nor their assignment. We could have the 9-bits range at the end, or mix them (for ex. 10-9-10-9-10). Finally the 5 occurrences of symbol could be anywhere between 0 and 4095, and they could be assigned these ranges in no particular order.

The suite of states is always "going down", generating a bit read each time it crosses the lower limit.
Fractional bits from a "high state" are consumed by going "low state".
Fractional bits from a "low state" are consumed by reading a full bit, and then reach a "high state".

What could be the efficiency of such a representation method ?
We are targeting 9.68 bits. Do we get close to it ?

Well, if we take the naive assumption that all values in the state range are equi-probable, then we can calculate the likelihood of getting 9 or 10 bits to read.

Average_bitCost = (2 x 9 x 512 + 3 x 10 x 1024) / 4096 = 9.75 bits

That's not bad, we are getting fairly close to 9.68 bits, but not there yet.

The trick is, these values are not equi-probable. In ANS construction, states at the beginning of the range are slightly more likely to be generated than states at the end of the range, due to the way 'newStateBaseline' are generated. So the lower range [0-511] is more likely to be generated than same size higher range [3584-4095], so it weights more.
Calculating the average is therefore not straighforward. But bear with me, if you don't have the patience to read the mathematical proof from Jarek Duda, the ANS construction converge towards the optimal cost of the Shannon limit (we were already pretty close to it to begin with).

This result, however, depends on the way Symbols are spread into the table.
Which is more complex, and will be detailed in another blog post.

Posted by Cyan at 7:54 PM 3 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows Seven 64 bits, using single-thread, fast algorithm

Filename Compressor Ratio Decoding speed
silesia zhuff v0.8 2.783 275 MB/s
silesia zhuff v0.9 2.805 311 MB/s

enwik8 zhuff v0.8 2.440 200 MB/s
enwik8 zhuff v0.9 2.460 230 MB/s

calgary zhuff v0.8 2.672 220 MB/s
calgary zhuff v0.9 2.693 250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

Posted by Cyan at 12:09 PM 8 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Monday, December 16, 2013

Finite State Entropy - A new breed of entropy coder

In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman prefix codes.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding, solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.

This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder , which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE 2.688 290 MS/s 415 MS/s
zlib 2.660 200 MS/s 220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE 6.316 300 MS/s 420 MS/s
zlib 5.575 250 MS/s 270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE 15.21 300 MS/s 420 MS/s
zlib 7.175 250 MS/s 285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.

-----------------------------------------------------

A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.

For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.

-----------------------------------------------------

List of Blog posts explaining FSE algorithm :

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
http://fastcompression.blogspot.com/2014/01/huffman-comparison-with-fse.html
http://fastcompression.blogspot.com/2014/02/a-comparison-of-arithmetic-encoding.html
http://fastcompression.blogspot.com/2014/02/fse-defining-optimal-subranges.html
http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html
http://fastcompression.blogspot.com/2014/02/fse-decoding-wrap-up.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-how-it-works.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-part-2.html
http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html
http://fastcompression.blogspot.com/2014/04/taking-advantage-of-unequalities-to.html

[Edit] : replaced huff0 by zlib on the comparison table
[Edit] : added entry on rANS variants by Fabian & Charles
[Edit] : added list of blog entries

Posted by Cyan at 11:25 PM 40 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest