0
$\begingroup$

In arithmetic coding a word is coded as the binary encoding of a number in a certain interval. The interval is determined from a sequence of nested intervals according to the probability distribution on the letters of the word.

This encoding and decoding process is totally clear to me, but what wonders me is how an encoder chooses the optimal number (i.e. the one with shortest length if written in binary)?

For example, in the above linked wikipedia article in the paragraph Sources of Inefficiency in the example they choose 0.538 as the message, which is not optimal as it has quite a long (way longer than 8 bits) binary expansion when written in binary, as also noted choosing 0.5390625 would be much better.

Also, if we change the probability model with a uniform distribution we get $(5/27, 6/27)$ as an interval, as shown in the paragraph the binary codings of the boundaries are quite long, much longer then the $~5$ bits, but if we choose 0ドル.1875$ then this could be coded with 0ドル.0011$ which requires five bits if we submit 00110ドル$ (the final 0ドル$ seems to be necessary for the decoder as discussed in the paragraph).

So choosing the right number in the interval is in my understand the key of the compression algorithm (also backed up with the intuitive understanding that if we have a larger interval; which is given by higher probabilites from individual tokens, then we have more possibilites to choose numbers from to find one with a short binary expansion), but every textbook or notes I find just concentrates on describing the interval nesting procedure, but not how to choose a shortest representative?

So, if my understanding is correct this seems to be an essential part? So how to achieve this?

asked Jun 2, 2017 at 16:51
$\endgroup$

1 Answer 1

1
$\begingroup$

While you are encoding symbols between the ranges, you are not choosing which to use for the specific symbol. You are narrowing down the possible chooses there are. You can arbitrarily choose them per symbol, but how would you adjust the ranges to make that the actual choice. It is true that the LAST symbol you can make a choice to make it fit within the rounding error , so long as it is within the range for that symbol.

The point of the example was that the incorrect distribution gives poor or worse compression. Choosing a better statics gives better results. Hoffman encoding is at most within 1 bit per symbol of encoding, where arithmetic coding is within a fraction of a bit per-symbol, if the correct statics are used. You can choose how accurate you do the encoding. Increasing the length of the RANGE's precision allows for more accurate representation of the given statics, thus less error. Choosing the wrong distribution for the specific message, regardless of the how accurate the encoding process is will get very poor compression or make it larger than the original message. So.. think of it is an error rate. The smaller you can make the error rate the, closer to the real limit you can achieve. choosing a "static" distribution for all text files, will never give you the smallest compressed file this specific method can give you.

JUST A NOTE: "Optimum" is misunderstood. It simple means that the best a specific method can be. Not that it is the highest compression available.

answered Jun 16, 2017 at 15:10
$\endgroup$
1

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.