Tuesday, March 29, 2011

Ross William's LZRW

Once upon a time...
Well, okay, it's not that far far away, but still, early 90's seems like prehistoric era for personal computing by now.
Ross Williams created a fairly interesting family of LZ compressors, featuring excellent speed and memory properties, and still providing good compression ratio (by these days standards), which he modestly called LZRW, for Lempel-Ziv-Ross Williams.

It has since become a classic in early compression algorithms, still being a competitive one even by today standards. Indeed, it's also the basis of several modern compressors, such as QuickLZ.

The main idea behind LZRW is this one : instead of providing an offset to the LZ decoder, just provide it with the pointer position into a table.

It looks complex or crazy at first, but it is quite logical, following Dr Ross's mind into his successive versions of LZRW :
Back in these days, computers featuring some 64KB of RAM were still commonplace. Better ones could reach an impressive 256KB. Therefore, since table sizes were seriously limited, capability to track all positions in the buffer to find the best match was, well, mostly out of reach.

LZRW does not target best maximal compression, but fast and practical (ie memory efficient) compression level. By accepting scarcity of references, it created some "holes" in addressable space, which translates into compression inefficiency.
Hence the main idea : by using the table position instead of the offset, there is no more any hole in addressable space : as long as the table is completely filled, all positions have, roughly, an equivalent probability to appear (well, roughly enough...). Furthermore, all matches, even long range ones, use exactly the same number of bits to be described, making them more efficient.

The downside is that the decoder must reference the data exactly as the compressor did, making it more complex (and slower) than more standard LZSS. But still, this trade-off is very acceptable, since the speed penalty is not too large.

The LZRW scheme features a "best spot" for optimal compression ratio : the larger the table, the better its "match find capability", hence a better compression ratio. However, the larger the table, the more bits it requires to describe the table slot to the decoder, hence a worse compression ratio.

The "best spot" is found where both forces collide.

Or is it ?
One problem is, the optimal "nb of bits" is not the same depending on the file to be compressed.
For example, with "enwik8", this optimal number is 14 bits.
For "open office", it is 19 bits.
For very small files, this number can go down to 10 or less.
Therefore, the question is : how to find the optimal table size, in order to get the "optimal compression ratio" for any given input ?

Alas, that question has no easy answer, and i still have to find one myself...

Saturday, March 26, 2011

Best of both worlds : mixed entropy

There is one thing that really amazed me when i started experimenting with entropy codings : Huffman is very good, in spite of its relatively old age and design simplicity. So good in fact that it can beat the much more modern, complex and slower Arithmetic coding in some circumstances.

For illustration, just have a quick look at entropy coders benchmark. The overall tendency is relatively clear : when no probability is too "large", the precision loss of Huffman is too small to be noticeable. Results are therefore relatively equivalent with Arithmetic.

Indeed, Huffman can even beat Arithmetic, and there are two good reasons for this :

- First, Huffman headers are much more compact than arithmetic ones. Since only bit length are necessary, it can be efficiently represented by a tree or a map, which can themselves be compressed easily. By contrast, arithmetic needs frequency count (or normalized frequency count), which are typically much higher values, with little opportunity for optimizations. As a consequence, Huffman headers are at least 4x smaller, and differences in the range of 6x-10x are common.

- Second : since headers are smaller, it's also possible to work with smaller Huffman blocs. Smaller blocs means statistics better describe the current bloc, which translates into improved compression ratio. This is called entropy adaptation.
As a consequence, it is possible to work with blocs of 16KB with huff0 (below this threshold, header costs start to override entropy adaptation benefits), while optimal bloc size for range0 is larger, at 128KB, due to big headers.
Entropy adaptation is the main cause which explains Huffman better compression ratio in some scenario.

However, on the other end, Arithmetic takes the lead, especially when one probability gets beyond 50%. And the closer it gets to 100%, the higher the difference. Huffman gets stuck to its best possible representation (1 bit per element), while arithmetic can continue to get benefits from higher probability levels. This is very visible on the lower part of benchmark table (results of 'enwik8-lz4hc-run' and 'loong').

Granted, these conclusions are related to the selected "semi-static" methodology, which requires a description header per bloc. A fully dynamic implementation would provide different results, much to the advantage of Arithmetic, since it suffers more from header size.
However, dynamic probability comes with its own price, namely speed, and wrong probability estimations.

In order to preserve high speed, i'm still working on block-based entropy coding.
However, in a will to improve compression ratio, i was wondering if it could be possible to benefit from both huff0 and range0 properties.

Enter HuffX, my new entropy coder, which for the time being is mostly a fusion of huff0 and range0, with a dynamic selector.

In its first incarnation, progresses are visible, as advertised in below benchmark results :



HuffX default with basic properties of huff0, featuring identical ratio, output and speed.
Things are getting interesting when input sequence is more squeezed : HuffX gradually provides more data blocs to range0.
In a very interesting "moot point", HuffX successfully outperforms both huff0 and range0, while retaining most of the speed of huff0. This is the kind of result i'm aiming at.
Then, HuffX ends up providing all blocs to range0, featuring ratio and speed close to it ... but not identical. HuffX still retains the 16KB bloc size of its older brother, which is a bit too small for pure range coding.

These first results show that improved compression ratio comes with a speed cost attached, as expected. Therefore, it still remains unclear if HuffX can find its way into fast compressors such as Zhuff or Etincelle.
Still some work ahead...

Wednesday, March 16, 2011

Zhuff : new version with multi-threading (v0.7)

After all these experiments, it is only logical to bring multi-threading capabilities to Zhuff.
And improvements are quite significant : on a pure in-memory comparison, 2 threads means (almost) twice faster, damn close to linear ramping.

Similar to LZ4, i've selected block division method due to its simplicity and parallel ramp-up capability. The blocks are a bit smaller, to preserve buffers memory.

Although more complex than LZ4, Zhuff is nonetheless fast enough to exhaust any mechanical HDD bandwidth with less than a single core. You will therefore need something faster to experience Zhuff speed on multi-cores systems, something like a RAM Drive.

Even then, Zhuff is likely to reach RAM Drive speed limits on quad-core systems, something i can't test myself, but would be glad to get reports on.



version threads Compression Ratio Speed Decoding
Zhuff 0.7 -t1 2.584 147 MB/s 278 MB/s
Zhuff 0.7 -t2 2.584 285 MB/s 550 MB/s


You can download Zhuff on its new homepage.

Saturday, March 12, 2011

Advanced Multithreading algorithm (Huff0 v0.9)

Not being entirely satisfied by latest results from Huff0 v0.8, i've been investigating the threading performance issue mentioned in the last post. It seems that the suspect was guilty : creating threads is a really costly operation.

This was not too sensible with LZ4, thanks to its naturally large chunk size, but this is not an all-around solution. Large chunks also means granularity is impacted. For example, if a test file is divided into 3 approximately equivalent chunks, on double core systems, one core will get twice more work to do, and will become the bottleneck of the system.

In order to solve such issue, jobs must be divided into smaller tasks, so that the potentially less efficient "tail job" remain small in comparison with the rest of the job, when all cores are working full steam.

In order to work on small thread tasks, the right thing to do : keep your threads alive, and feed them regularly.

There are obviously several ways to achieve this, including "do it yourself" ones. Right now, i've been experimenting 2 well documented ones for Windows : Thread Pools, and I/O completion port.

On first look, Thread Pools seems the obvious way to go. This windows API was meant for such task : just pile up jobs, and they will get dealt with as fast as system can complete them, dispatching them over available cores.
There is just a little problem, this API is relatively recent, and unavailable to Windows XP and prior systems. Quite a deal breaker.
Or is it ? Its little cousin, QueueUserWorkItem() was created in 2000, and does exactly the same.
And its usage is damn simple. The structure is the same as CreateThread(). Just use it as a kind of "fire and forget" job queue.
It's extremely efficient. Using this API, I was able to decrease chunk size by a factor 100 with no performance degradation. Quite an achievement.

There are just 2 little problems :

1) There is no API to check when all jobs are completed. Quite a dumb (or faulty?) design. This was apparently later corrected for other languages (C#, C++, even VB), but never for C. In order to know when all segments are compressed, it is necessary to create one's own method. And things are becoming quite nasty at this stage. Since this problem is pretty frequent, many methods can be found on searching Internet, several being questionably stable hackish-styled.
The most interesting one, in my opinion, is Interlocked counter, in conjunction with event wait. Well, now the whole system is a lot more complex than it was supposed to be.

2) You get no control on system resource. No way to tell for example "use just 2 cores please". This simple API uses whatever CPU resource is available.

OK, this starts to make a lot of issues. So i investigated the second method, I/O completion port.
It sounds scary, and ... it looks scary too. On using this old API's method (available since Windows NT4!), it's obvious it was never meant to become a general-purpose job scheduler. But that's one thing it does, and very well.
Initialization is significantly more complex than Thread Pools, but once this is correctly understood, you get all the advantage without the problems. Speed is excellent, number of concurrent thread is finely tuned, and job completion check is much more natural and secure.

So here it is. This new version of Huff0 (v0.9) is using I/O completion port to divide data into much smaller chunks, resulting in more stable and faster results. A little scanning improvement was also added, for the benefit of faster compression.

Detailed performance assessment :


Huff0 v0.8 -t2
Huff0 v0.9 -t2

Ratio Compress Decoding Ratio Compress Decoding
Not compressible





enwik8.7z 1.000 1.42 GB/s 1.91 GB/s 1.000 1.50 GB/s 1.94 GB/s
Hardly compressible





win98-lz4hc-lit 1.024 850 MB/s 995 MB/s 1.024 905 MB/s 995 MB/s
audio1 1.070 540 MB/s 525 MB/s 1.070 560 MB/s 530 MB/s
Distributed





enwik8-lz4hc-lit 1.290 395 MB/s 375 MB/s 1.290 400 MB/s 380 MB/s
Lightly Ordered





enwik8-lz4hc-offh 1.131 375 MB/s 350 MB/s 1.131 380 MB/s 360 MB/s
Ordered





enwik8-lz4hc-ml 2.309 375 MB/s 380 MB/s 2.309 375 MB/s 385 MB/s
Squeezed





office-lz4hc-run 3.152 330 MB/s 310 MB/s 3.152 420 MB/s 400 MB/s
enwik8-lz4hc-run 4.959 475 MB/s 430 MB/s 4.959 480 MB/s 430 MB/s
Ultra compressible





loong 278 1.49 GB/s 3.22 GB/s 278 1.98 GB/s 3.50 GB/s


The interesting bit is here : this new threading method is generic, and can be applied to any other program. So this is good news for further improvements.

Wednesday, March 9, 2011

Multi-threaded entropy experiment : Huff0 v0.8

As a logical step following LZ4, i've been porting multi-threading capabilities to Huff0. For the time being, it is only available in benchmark mode. But since Huff0 is not really a compressor, rather a component for compressors, i do not plan to move it to the file interface for now.

However, there were some useful learnings during this implementation.

First, on multi-threading cost.
LZ4 has a very simple design, it was only natural to keep the large block structure and adapt it to parallel compression. With Huff0, doing the same directly results in a dramatic performance slowdown.
The reason for it : creating new threads costs some performance overhead.
In order to limit this cost, a simple solution is to limit the number of times a new thread is created; a very simple way to achieve this is to deal with larger blocks of data.
Huff0 therefore features a double-segmentation layer, with larger blocks defined at thread level, while smaller ones are dispatched for entropy adaptation purpose (i.e. compression rate is better if blocks are smaller, because the statistics are more adapted to the block).

Thinking aloud, i'm wondering if performance would be better by changing threading strategy : instead of creating a new thread for each block, what about "sleeping it", filling its tables and resume it ?
I'm not sure if it would change performance conclusion, but it's probably worth a try.
[Edit] and indeed, it is, see follow-up : http://fastcompression.blogspot.com/2011/03/advanced-multithreading-analysis-huff0.html

Second, on memory access behavior.
This part is a bit more tricky.
Making a code "multi-thread compatible" requires little more than getting rid of all form of global variables, keeping only local ones (ie dynamically allocated into the stack) or, when necessary, allocating memory in a kind of "context structure" which will be provided as part of the function parameters. This second point makes the code very much "object-oriented".
Now on the findings : i've been witnessing some performance hit on highly accessed memory slots. When, for example, a statistics or counter structure is declared, and due to its nature is updated very often, performance is better when the structure is declared as global variable, rather than local ones. (Note that declaring "static local" variables has the same effect).
I'm not sure to fully understand that one, but i've found a lead : this comment :

"Global variables are accessed via a single known address, whereas local variables are typically accessed by indexing off an address register."

In calculating the memory address, local variables require one more fast addition operation than global ones. It sounds very little, but precise benchmarks do feel a difference.
I still have to find a way around this performance hit though, if that's possible...

You can download the new version of Huff0 here.

Monday, March 7, 2011

Reaching system speed limits

Although results from CompressionRatings benchmark were good, i was nonetheless surprised to notice that decoding speed would not improve on increasing the number of threads.

The only explanation i could come up with is that LZ4 decoder is decompressing data faster than the RAM Drive can deliver. Which is likely a correct statement, considering the in-memory benchmark results (LZ4 can decompress at an average 800MB/s per core, while RAM Drive can only deliver between 400 and 500 MB/s).

However, it would not explain why compressing with 4 threads was faster than decoding...

Here also, there is a plausible explanation : writing to RAM Drive may be slower than reading. In this case, compression has an advantage, since it writes less data.

Now, let's put that hypothesis to the test. I built a quick benchmark to measure read and write speed from a RAM Drive installed into a Windows XP box. I'm using Gavotte's Ramdisk for this test. Using another ramdisk might result in different speed, but is unlikely to dramatically change the conclusions. A different OS may also change results, but since CompressionRatings run with Windows XP, i'm mostly interested in mimicking the same conditions.

On running the benchmark, i witnessed a very stable 1190 MB/s for read operations.
On the other hand, write operations were limited at "only" 770 MB/s.
So now, that's confirmed : writing is slower than reading.

It's consistent with CompressionRatings results. Extrapolating from these figures :
if compressing at a ratio of 2:1, then the ramdrive r+w speed is limited to 525 MB/s.
On decoding the same data 1:2, the speed limit is now down to 455MB/s.
Hey, that's about the same ratio as LZ4 compression/decompression speed difference ...

Thursday, March 3, 2011

LZ4 : confirmed as world's fastest compressor

After Stephan's SqueezeChart statement on LZ4 being the fastest compressor seen on his public benchmark, LZ4 was once again put to the test, this time by Sami Runsas 's Compression Ratings.

This benchmark is very interesting, since its policy is very clear, its benchmark corpus is intentionally heterogeneous and carefully selected (so much that i use it regularly for calibration), and its emphasis on speed and precision makes it a reliable comparison tool, especially for fast compressors.

It is also heavily multi-thread oriented. Therefore, in order to sustain the comparison, LZ4 needed to be upgraded with multi-threading capabilities, since it would stand no chance when running on a single core against the best-of-breed multi-threaded. That's where enters version 0.9....

And it fares pretty well on its first try.
LZ4 gets top position on both compression and decompression speed, while also providing better ratio than previous king of the hill, the excellent 4x4 from Bulat Ziganshin.
And the difference is pretty large : +50% for compression speed, +25% for decompression speed using less than a full thread to max out, and +9% compression ratio.

As a complement, it's interesting to observe that the decoding speed of LZ4 does not vary by increasing the number of threads. That's because decoding speed is so fast that half a CPU usage is enough to exhaust RAMdisk steam.

With such a speed available, compressing/decompressing data costs about the same time as just copying it ... from a main memory RAM Drive. It opens the way to almost free compression for, well, any usage, even on-the-fly compression of temporary data for instance.

LZ4 is available on its homepage.

Tuesday, March 1, 2011

Multi-threading compression available : LZ4 v0.9

Previous versions of LZ4 were displaying promising results using multi-threading, but only in benchmark mode. Now this feature is made fully accessible, with real "file i/o" interface.

You will need a very fast disk drive to experiment with it. In effect, only a RAM Drive can expect to keep fast enough steam to feed LZ4, especially when using multi-threading modes.

Block subdivision method was selected, due to its design simplicity and scalability. Compared with previous versions, even single thread mode greatly benefits, since multi-threading requires asynchronous data loading, which means that reading and writing to the disk is done in parallel with compression.

I can't really test beyond 2 threads, since i only have dual-core systems within reach, but the code should prove pretty scalable with quad-core systems and more (up to 32 threads).

You can download and test this new version on the LZ4 homepage.
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /