Tuesday, January 4, 2011
Entropy evaluation : small changes in benchmark corpus
You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.
Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.
For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :
Huff0 v0.6
Range0 v0.7
Ratio Compress Decoding Ratio Compress Decoding
There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.
lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.
One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.
This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.
This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.
Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.
For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :
Huff0 v0.6
Range0 v0.7
Ratio Compress Decoding Ratio Compress Decoding
Not compressible
enwik8.7z 1.000 810 MB/s 1.93 GB/s 1.000 885 MB/s 1.93 GB/s
Hardly compressible
win98-lz4hc-lit 1.024 465 MB/s 600 MB/s 1.019 374 MB/s 262 MB/s
audio1 1.070 285 MB/s 280 MB/s 1.071 174 MB/s 83 MB/s
Distributed
enwik8-lz4hc-lit 1.290 205 MB/s 194 MB/s 1.296 150 MB/s 77 MB/s
Lightly Ordered
enwik8-lz4hc-offh 1.131 197 MB/s 184 MB/s 1.133 145 MB/s 79 MB/s
Ordered
enwik8-lz4hc-ml 2.309 214 MB/s 195 MB/s 2.326 160 MB/s 77 MB/s
Squeezed
office-lz4hc-run 3.152 218 MB/s 202 MB/s 3.157 164 MB/s 98 MB/s
enwik8-lz4hc-run 4.959 245 MB/s 224 MB/s 5.788 188 MB/s 148 MB/s
Ultra compressible
loong 278 785 MB/s 2.93 GB/s 1430 555 MB/s 427 MB/s
There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.
lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.
One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.
This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.
This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
[フレーム]