Friday, January 21, 2011

ARM Programming (part 1)

There is a sense of moving tide in the world of CPU business. The nimble ARM (Advanced Risc Machine) looks in position to threaten almighty Intel future.

This is a classical david versus goliath situation, with all our sympathy usually reserved to the little outsider. But in this case, i do believe the little one indeed deserves sympathy.

While Intel business was growing thanks to PC business, it benefited from its near-monopoly position to become a dominant player in the field of micro-electronics at large. Its processors were fast enough to quickly displace RISC ones from anything the size of a laptop or more. It then entered the server market were it made small wood of specialists such as Alpha or Sun. Its domination seemed irresistible, and even large strategic failures such as Rambus one did nothing to really alter this situation.

During all this time, there is one area which remained free of Intel x86 domination, it is the world of small hand-held terminals and equivalents, such as smartphones. In this world, power consumption is key, and the huge requirement of x86 processors (60W!) were not meant to fit the bill.

ARM strived, proposing better value for the money, and more importantly inventing a new business model. Since ARM was (and still is) a small company, it does not manufacture nor does sell processors.

Being fabless is nothing new. In the CPU business, Cyrix showed great promises during its era. Nvidia and ATI are two gigantic fabless companies. The model is so compelling that even AMD converted itself to the fabless model recently.

However, not even selling processors was a bold move. ARM only sell licenses, for other to build processors based on its technology. As an analogy, they sell the plans for a house, not the house itself. Consequently, the cost of the plans is much lower than the house. So low in fact, that any entrepreneur can buy them and make a little fortune by selling houses. Hence an attractive model.
The decision, however, was not only about cost. It was also meant to support customizations. In the world of embedded equipment, such customizations are commonplace, and it was necessary to find a model to accommodate them.

ARM found a very important customer when Nokia decided to switch its CPU line to this architecture. With volumes reaching incredible levels, and many industrial customers of all sizes eager to produce their own ARM processors, it quickly became the de facto standard for handheld devices, overshadowing all other competitors.

Today, ARM processors are sold in much larger volume than x86 does, and the difference only grows. ARM as a company, though, is still a (relatively) small one, and the strength of this architecture does come from its ecosystem, not from the design company alone.

What changed recently is that both world collided, with the advent of new Tablet terminals, such as to Apple iPad. Now, for the first time, ARM-based terminals are investing an area traditionally reserved to PC siblings, such as small laptops or netbooks. And this is not just a small attempt by strange technological company, massive involvement from Apple ensures that this new market is here to stay, with many new players entering the fray this coming year.

Now, we'll witness a fight to settle where the frontier between the PC (x86) world and the Smartphone (ARM) will be.

Just the border? Or is it more complex ? Indeed it is.
Intel has been preparing its counter-offensive, and is now ready to propose ultra-low power x86 CPU able to be embedded into SmartPhones, such as the next iteration of Atom. On the other side, ARM has announced a new processor design for the Server market. Now that's sure, this is all out war, not just a border dispute.

It is my own guess that Intel attempt to enter the Smartphone market will fail, while ARM entry into the server market will ultimately be successful. There are several reasons for this belief :

- ARM processors consume much less power than Intel ones; for server CPU, look at something in the range of 10x difference. Intel will have to fight back with lower-consumption model (and they will).

- In a large server room, the main priority is power supply. The more CPU consumes watts, the more heat becomes a problem, with large cost and space penalty associated. A large consumption difference will have a tremendous effect on these installations.

- The server world is not really about Windows, but rather more about Linux. Linux derivatives and associated Open Source programs already work well with ARM (they can be compiled anytime to any target since the source code is available). The transition is therefore not as tragic as it look like.
Anyway, even Windows has announced that they will provide a Windows version for ARM soon.

So here it is, ARM processors are investing more and more area, and are likely to become a dominant platform within this decade. Therefore, being able to program for them will become an expected skill from any programmer.

In the next post, we'll see which differences have to be considered for your code to be ARM friendly.

Friday, January 14, 2011

Multi-threaded compression

Since a few years now, most PC systems sold are using multi-cores processors. As a logical consequence, there is an increase pressure from users to create applications capable to use all these available cores, since sleeping silicon seems a waste.

Obviously, this trend also reached compression utilities. Now the question is, how all this does work ?

There are several methods, and we'll investigate the easiest one to begin with. Simple thinking make us consider splitting the source file into smaller blocks, and process these blocks in parallel. This is easy to understand and setup, and in practice works well. Very well indeed, since this generic method can be extended to any number of cores, at least as many as there are blocks.

But there is a drawback : it is not possible to use in a block information from another block to improve compression. And this is a real issue, with direct impact on compression ratio.

Fortunately, for "low-depth" compressors such as LZ4 or Zhuff, this problem is not too large. Since these compressors use the last 64KB of data to predict the next bytes, the inefficiency is clearly limited to the first 64KB of the block. Which means that, for a big enough block, this inefficiency can be made small.
But how much is "big enough" ?

I was asked this very question by Con Kolivas, working on his own lrzip compressor (only available for Linux for the time being). Since lrzip is a heavily multi-threaded implementation of 7zip on top of rzip, he wants to dispatch as many processes as possible, while keeping memory usage in check. Selecting a too large block size has its own drawback, such as consuming more memory, creating more latency before cores are loaded, and such. Therefore, we want block size to be as small as possible, but not too small, to control the inefficiency within manageable area.

Here are some results taken out from LZ4 working on enwik8 with various block sizes, expressed as a multiple of search depth :

Block Size Inefficiency
2X 2.4%
4X 1.2%
8X 0.6%
16X 0.3%
32X 0.15%


As can be seen, the inefficiency is reduced by 2 at each step, which is not a random effect : the "inefficient" part of the block is only the first 1X, therefore its "relative" importance is divided by 2 each time block size is multiplied by 2.

All in all, enwik8 is probably one of the worst cases (outside of specifically crafted ones) since it heavily depends on match finds, so it is a good basis to select a threshold.

In my opinion, 16X is already quite good, with a low enough inefficiency ratio.
16X for 64KB, this means 1MB block size. In a word, manageable.

Monday, January 10, 2011

Improving I/O : Zhuff v0.5

There is a subtle difference between Windows Seven and good old Windows XP when it comes to reading data from a disk : the prefetcher has matured much.

To say it simply, the prefetcher is in charge of loading data into memory before it is even required by the program. Sounds like magic ? Well, not really : whenever you look at a film for example, it's pretty obvious that the movie file will be loaded sequentially. So after the program has requested data from position 1M to position 2M, it's quite logical to expect the next read operation to be from 2M to 3M.

By prefetching HDD data into memory, the OS makes it (almost) instantly available to the program, smoothing video playback, or whatever the program wants to do with it.
Obviously, simplest prefetching would be to load the entire file into memory. In this case, we call it a "cache". But this is not really practical. A film is several GB large for example. So a simpler version would be to simply load the next KB of data, just in case they would be required by the program later.

In practice, this works relatively well. There are nonetheless some subtle differences between systems, one of which i just learned this week, studying XP behavior.

I was puzzled by the fact that newest v0.4 version of Zhuff was reported sometimes slower on some systems running XP, while the underlying compression algorithm was proven to be much faster. The only other identified change was a move from 512KB chunks to 1MB chunks.
This proved enough to put too much load on XP prefetcher. While such a change produced no effect on my Windows Seven test system. Hence my surprised.

I worked my way through a new I/O code, which would chunk code into smaller parts, down to 256KB. Below that point, there is a small compression penalty. Nothing too large though...

It required a bit more than words, decoupling buffer from read/write operations. Anyway, the results were better than expected. To give an example, using a RAM disk drive under Windows XP, one of my test file went down from 5.20s (v0.4) to 4.50s (v0.5). Quite a large improvement for just an I/O change.
By contrast, Windows Seven did not resulted into any change : timing remained steady at 4.75s, whatever the chunk size.

Finally, you can get your hand on the result, a new version of Zhuff, that you can download here.
Improvements will be mostly visible using a very fast hard disk, such as an SSD, or a RAM drive.

Friday, January 7, 2011

An old friend returns : new version of Zhuff (v0.4)

Since working on LZ4 and then Huff0 improvements, it's quite natural to give another look at Zhuff, an old compression program that was created by simply combining LZ4 and Huff0 together.

There is slightly more to it. Zhuff also uses a sliding window, in place of the simpler chunking used in LZ4. It avoids some memory wasting, and keep the compression potential at full window size all the time. I also added an incompressible segment detector.

Zhuff succeeded at its time to be the fastest compressor for its compression rate, winning first place in several benchmarks. Since then, the focus changed to multi-threading, faring zhuff at a disadvantage since it uses just one core. Nonetheless, it still features an excellent energy-efficient profile, just behind Etincelle (which works best on larger source files).

This could change in the future, but for the time being, let's just re-use recent learnings to improve over last year version.

It works relatively well. The new version is significantly faster, and on top of that, slightly improves compression ratio.


version Compression Ratio Speed Decoding
Zhuff 0.4 2.532 161 MB/s 279 MB/s
Zhuff 0.3 2.508 139 MB/s 276 MB/s

You can note that the compression ratio is quite better than LZ4HC, while the speed is much faster. It simply shows that entropy compression power is stronger than full search, while costing much less CPU.

On the other hand, decoding speed is sharply reduced compared to LZ4, which is also an effect of second-stage entropy compression.

You can download it on its webpage.

Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :




Huff0 v0.6

Range0 v0.7

Ratio Compress Decoding Ratio Compress Decoding
Not compressible





enwik8.7z 1.000 810 MB/s 1.93 GB/s 1.000 885 MB/s 1.93 GB/s
Hardly compressible





win98-lz4hc-lit 1.024 465 MB/s 600 MB/s 1.019 374 MB/s 262 MB/s
audio1 1.070 285 MB/s 280 MB/s 1.071 174 MB/s 83 MB/s
Distributed





enwik8-lz4hc-lit 1.290 205 MB/s 194 MB/s 1.296 150 MB/s 77 MB/s
Lightly Ordered





enwik8-lz4hc-offh 1.131 197 MB/s 184 MB/s 1.133 145 MB/s 79 MB/s
Ordered





enwik8-lz4hc-ml 2.309 214 MB/s 195 MB/s 2.326 160 MB/s 77 MB/s
Squeezed





office-lz4hc-run 3.152 218 MB/s 202 MB/s 3.157 164 MB/s 98 MB/s
enwik8-lz4hc-run 4.959 245 MB/s 224 MB/s 5.788 188 MB/s 148 MB/s
Ultra compressible





loong 278 785 MB/s 2.93 GB/s 1430 555 MB/s 427 MB/s


There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

Monday, January 3, 2011

Range0 : new version (v0.7)

Several learnings from the Huff0 new version were generic enough to become applicable to Range0. More specifically, the interleaved counter and the call to memcpy(), resulted in significant improvements, as displayed below :

Range0 v0.6 Range0 v0.7
R C D R C D
Not compressible
enwik8.7z 1.000 870 1400 1.000 885 1930
Hardly compressible
audio1 1.071 174 83 1.071 174 83
Distributed
enwik8-lz4-lit 1.370 155 76 1.370 155 76
Lightly ordered
enwik8-lz4-offh 1.191 138 76 1.191 138 76
Ordered
enwik8-lz4-ml 2.946 155 83 2.946 160 83
Squeezed
enwik8-lz4-run 4.577 163 116 4.577 180 116
Ultra compressible
loong 1430. 362 427 1430. 555 427


The memcpy() makes wonder at improving speed for incompressible segments.
More importantly, interleaved counter speed up squeezed distribution compression.

The ultra-compressible corner case is not so important, in spite of the huge performance benefit, but the squeezed distribution benefit is, since this is Range0 most likely use. At 180 MB/s, it makes it almost as fast as standard Huffman coders, which basically means extra compression performance for free.

The bad point is, and will remain, the decoding speed, which cannot beat Huffman, due to the presence of a division into the main loop. However, the decoding speed is not too bad for the specific "squeezed" situation we want Range0. Indeed, should the distribution become even more squeezed, decoding speed would become even better.

Evolving the benchmark corpus in order to consider LZ4HC output, instead of Fast LZ4, is likely to make this statement more relevant.

For the time being, Range0 can be downloaded here.

Sunday, January 2, 2011

Huff0 : New version (v0.6)

Since working on the stats produced by huff0, some inefficiencies became apparent, and as a consequence got solved in a newer version, available on the new huff0 homepage.

To properly understand what was changed, here is a performance snapshot, comparing v0.5 (old) with v0.6 (new).

Huff0 v0,5 Huff0 v0,6
R C D R C D
Not compressible
enwik8.7z 1.000 740 1400 1.000 810 1930
Hardly compressible
audio1 1.070 285 280 1.070 285 280
Distributed
enwik8-lz4-lit 1.363 210 200 1.363 210 200
Lightly ordered
enwik8-lz4-offh 1.188 188 181 1.188 188 181
Ordered
enwik8-lz4-ml 2.915 200 197 2.915 210 197
Squeezed
enwik8-lz4-run 4.284 209 218 4.284 235 218
Ultra compressible
loong 278.0 450 2930 278.0 785 2930

Changes are underlined in bold characters.

To begin with, only speed is affected by the changes. Compression ratio remains strictly identical.

The first changed is in the way data is scanned. This is a nifty trick, related to cache behavior. When updating a data, such data should not be updated again immediately, otherwise there is a "cache" penalty : the previous change must be fully committed before the new one get accepted.
In order to avoid such penalty, we interleave data changes, so that the CPU get enough time to deal with repeated changes on the same value. It makes the code slightly more complex and data structure a bit bigger, but is definitely worth it : although it affect negatively situation such as "not compressible", on the other hand the more some data is present, the better the benefit.
We end up with a grand +70% improvement for "ultra compressible" corner case. Not only corner case benefit though, "ordered" distribution get a nice +5% boost, while squeezed one get a bit more than +10%.

The second change is on the way incompressible data is detected. While the end result is still slower than Range0, there is a pretty nice performance boost by betting earlier on the incompressible nature of data just scanned. It avoids later algorithm to be triggered, thus saving time and energy. This is a nice +10% boost.
While this gain is only visible on "not compressible" example, it still can achieve real situations benefits. It is not uncommon for some parts of large files to be incompressible. For example, an ISO file may contain a few pictures in compressed format. In this case, the better the detection, the faster the compression, since any attempt to compress it is likely to fail or end up with miserable gains.

The third change is really straighforward : i've removed my hand-made "copy" operation, and swapped it with the standard memcpy() call of C library.
Although only useful when some segments are incompressible, in this case, the gain is really noticeable, at about +30%. For just a simple function call, this is really worth it.

I still have to understand how come the call to memcpy() is faster than my own simple loop. There is probably some very interesting learnings behind.

Side comment, i tried the same trick on "ultra compressible" file, replacing my own simple loop with a memset(). It resulted in no speed difference. I tried then to improve the simple loop by making more complex parallel executions, but it resulted in a loss.
My only explanation so far is that the compiler probably translates my simple loop into a memset(), transparently.

Anyway, please feel free to test.
The binary can be downloaded at Huff0 homepage.

Saturday, January 1, 2011

Quick look back at Huff0 : an entropy coder analysis

Since LZ4 is pretty much limited in testing match searches over long distances, i'm almost pushed into writing a new compression software to continue investigations.

Therefore, since such an effort is required, it is also a good time to have a fresh look at entropy coder.

Entropy is generally not used as a direct compression method. Its better use is *after* modeling.
For example, after an LZ77 compression pass, we end up with many match lengths, which distribution is not random : short lengths are much more common than longer ones.
Entropy coder will ensure that frequent symbol get coded using less bits than uncommon ones. That's the basis for better compression results.

The most renowned entropy coder is probably the Huffman one. It is no longer considered top of the line, but continue to be used in many mainstream applications, due to its simplicity and speed.
Huffman will guarantee that, given a symbol distribution, the best possible combination of bits per symbol will be created. That's a strong property.

Since Huffman proved that its method was the best possible, entropy coders received little attention before a few decades later, when Arithmetic coders started to be implemented.
This new beast is a lot more complex to explain, so i'll invite you to read the wikipedia entry to get more details. For the purpose of this post, it is enough to accept that arithmetic coders can attribute the equivalent of a fractional bit to a symbol. Yes, it becomes possible to encode a symbol using, for example, 0.38 bits.
This may sound strange, but it's all mathematics.

By breaking the "integer" boundary, arithmetic coders allow for much better precision, and therefore better compression. This is especially important when probability is higher than 50% : Huffman cannot do better than providing 1 bit per symbol, which is just right for 50%, but is much less optimal for 80%. In such instance, arithmetic coder will do much better.

A lesser version of arithmetic coder is called Range coder. It's pretty much the same, except that it works with integer numbers, and, more importantly, is free from patents, unlike arithmetic ones.

In order to test both method, i've created 2 programs, Huff0 and Range0.

Huff0 is a block based Huffman Entropy coder. Each data is divided into blocks of 16KB. The compressed output consists in a tree, followed by compressed entry.
Smaller blocks allow faster adaptation, but cost more in header, so a trade-off had to be chosen. 16KB was selected as being mostly optimal for distributed input, such as literals, or most files.

Range0 is a block based Range Entropy coder. Each data is divided into blocks of 128KB. The compressed output consists of a frequency count, followed by compressed entry.
Why 128KB ? Because the frequency count is much heavier than the Huffman tree. As a consequence, dividing data into smaller blocks results in a loss, in spite of better adaptation.

Such trade-off between adaptation speed and header costs deserve a full study in itself, and we'll keep that for another post.

In order to better understand the properties of each coder, i needed a benchmark corpus. I started by using files, but this is nonsense. As stated earlier, entropy coders are not used as direct compression tools, but rather as a "second stage" pass, applied to statistics or probabilities output from model.
So it sounded more reasonable to create files which would mimic the statistics produced by such models. Here is what i ended up with :

Huff0 v0,5 Range0 v0,6
% C D % C D
Not compressible
enwik8.7z 100,01 740 1400 100,00 870 1400
Hardly compressible
audio1 93,43 285 280 93,38 174 83
Distributed
enwik8-lz4-lit 73,35 210 200 73,00 155 76
Lightly ordered
enwik8-lz4-offh 84,18 188 181 83,95 138 76
Ordered
enwik8-lz4-ml 34,30 200 197 33,94 155 83
Squeezed
enwik8-lz4-run 23,34 209 218 21,85 163 116
Ultra compressible
loong 0,36 450 2930 0,07 362 427

That's a lot of results for a single post. But there is a lot to learn from it.
enwik8-lz4-y are files created by using LZ4 on enwik8, extracting a specific field. This is the fastest LZ4 version, not the HC one, so this could change in the future.
% is the resulting size, as % of original file.
C is the compression speed, in MB/s.
D is the decoding speed, in MB/s.

Not compressible : as stated, this 7z entry is already compressed, so there is nothing left.
The basic idea here, is to detect this situation fast enough, and to switch to backup copy mode.
The catch, is that this mode should not be too triggered too easily, otherwise we may lose some compression.

What immediately look strange here is that Huff0 is slower than Range0. Probably there is still some room for improvement.

Hardly compressible : this is a good torture test for the compression detector. This entry is a wave file, 16 bits stereo. As a consequence, when dealt with directly by an entropy coder, half the data is purely random (lower bits), and the other half has a small tendency to lean towards an average threshold. All in all, the compression that can be achieved is less than stellar, but still measurable.
We are starting to witness a tendency that will only grow stronger : Range0 compress better, but slower. Especially, the decoding speed of Range0 is an issue when compared with Huff0.

I will skip directly to
Squeezed : in this file, one of the symbol has a probability higher than 50%. And it pays off quite a lot for Range0. This is no longer about a .2% small benefit, but a more sensible 1.5% advantage. And this figure can only become better if using the HC version of LZ4, which will produced even more squeezed output.

Ultra Compressible : this is a test file, produced by Eugene Shelwien, to test limit conditions when compressing long streams of identical bytes. It is not a single byte, however, there are some changes from time to time, to make life harder to the coder.
What should immediately be visible, is that Huff0 succeeds in producing a file which is much better compressed than 1/8, the theoretical minimum. This is thanks to the relatively small size of blocks : when a block consists of only one repeated byte, it gets a specific shortcut sequence. The decoding speed shows this is highly effective.
This is in contrast with Range0, which decoding speed is *only* 427MB/s. This is due to larger block size, which means we more rarely have a full block with only one byte; as a consequence, it must be decoded, which is slower than the special escape sequence.
Note however, that, as could be expected, Range0 compression is much better than Huff0 on this example.

So, where does that lead us to ?
Well, that's just a start. It provides some interesting ground for future comparisons, allowing targeted improvements to become measurable.

More of it in a future post...
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /