Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :




Huff0 v0.6

Range0 v0.7

Ratio Compress Decoding Ratio Compress Decoding
Not compressible





enwik8.7z 1.000 810 MB/s 1.93 GB/s 1.000 885 MB/s 1.93 GB/s
Hardly compressible





win98-lz4hc-lit 1.024 465 MB/s 600 MB/s 1.019 374 MB/s 262 MB/s
audio1 1.070 285 MB/s 280 MB/s 1.071 174 MB/s 83 MB/s
Distributed





enwik8-lz4hc-lit 1.290 205 MB/s 194 MB/s 1.296 150 MB/s 77 MB/s
Lightly Ordered





enwik8-lz4hc-offh 1.131 197 MB/s 184 MB/s 1.133 145 MB/s 79 MB/s
Ordered





enwik8-lz4hc-ml 2.309 214 MB/s 195 MB/s 2.326 160 MB/s 77 MB/s
Squeezed





office-lz4hc-run 3.152 218 MB/s 202 MB/s 3.157 164 MB/s 98 MB/s
enwik8-lz4hc-run 4.959 245 MB/s 224 MB/s 5.788 188 MB/s 148 MB/s
Ultra compressible





loong 278 785 MB/s 2.93 GB/s 1430 555 MB/s 427 MB/s


There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

No comments:

Post a Comment

[フレーム]

Subscribe to: Post Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /