FARI is an arithmetic compressor with extremely high compression/decompression speeds. It uses an order 17 adaptive bitwise model (rate=1/16) and compresses at over 20mb/s on my laptop. The current release version is 0.3, as I feel that an arithmetic coder with slower decompression than compression cannot be considered a final version. Benchmarks as as follows:
Code:ENWIK8 : 41,027,671 bytes ( 4.3 seconds encode/ 6.0 seconds decode) ENWIK9 : 390,794,122 bytes (43.1 seconds encode/60.0 seconds decode) Run on a 2.20ghz mobile i7 (8 cores) with 8gb of RAM Tested executable is FARI-x64-AVX.exe
Nania Francesco (6th December 2013),samsat1024 (16th December 2013),Stephan Busch (6th December 2013)
can you say how AVX is employed in your program? using SIMD for compression becomes intersting for me in last weeks
I did no work in assembler or manual optimization, so the only AVX present in the provided executable was generated by allowing MSVC2013 to use AVX instructions.
[removed...]
Last edited by Bulat Ziganshin; 6th December 2013 at 00:42.
I'm going off topic slightly here: it's useful for a few cases such as fast string searches (load 16 characters, if they don't match, use move mask to tell you the first non-matching position) and paq-style counters (i.e. compares become masks for operations, and you can update 8-16 counters at once).Quote Originally Posted by Bulat Ziganshin View Post... using SIMD for compression becomes intersting for me in last weeks
But in my opinion, it is not very useful before/after load and store operations. For example, SIMD for 8 paq counters was only 20% faster than incremental because of the cost of loading from distant memory locations.
On the other side, you can do summing and rescaling extremely efficiently, such as cumulative count for arithmetic coding or rescaling totals down. Just my 2 cents.
Bulat,
Have you seen Jan Wassenberg's work from the Fraunhofer Institute on vectorized lossless image compression?
http://onlinelibrary.wiley.com/doi/1....1109/abstract
"This report introduces a new lossless asymmetric single instruction multiple data codec designed for extremely efficient decompression of large satellite images. A throughput in excess of 3GB/s allows decompression to proceed in parallel with asynchronous transfers from fast block devices such as disk arrays. This is made possible by a simple and fast single instruction multiple data entropy coder that removes leading null bits. Our main contribution is a new approach for vectorized prediction and encoding. Unlike previous approaches that treat the entropy coder as a black box, we account for its properties in the design of the predictor. The resulting compressed stream is 1.2 to 1.5 times as large as JPEG-2000, but can be decompressed 100 times as quickly – even faster than copying uncompressed data in memory. Applications include streaming decompression for out of core visualization. To the best of our knowledge, this is the first entirely vectorized algorithm for lossless compression."
It looked interesting to me, and apparently he's got something better now---with very fast compression as well as decompression---that he's not actively doing anything with right now for boring corporate reasons. I'm obviously not competent to judge this sort of thing, though... (You seem to be.)
BTW, as I understand it, Wassenberg's entropy coder is very crude compared to most of the stuff discussed in this thread, but very fast. It wouldn't be a competitor for the small-fraction-of-a-bit stuff discussed here, but I thought it was interesting under the general heading of "using SIMD for compression" as Bulat mentioned.
An encoding with faster compression than decompression could be useful in some circumstances, e.g. backups.Quote Originally Posted by david_werecat View PostFARI is an arithmetic compressor with extremely high compression/decompression speeds. It uses an order 17 adaptive bitwise model (rate=1/16) and compresses at over 20mb/s on my laptop. The current release version is 0.3, as I feel that an arithmetic coder with slower decompression than compression cannot be considered a final version. Benchmarks as as follows:
Code:ENWIK8 : 41,027,671 bytes ( 4.3 seconds encode/ 6.0 seconds decode) ENWIK9 : 390,794,122 bytes (43.1 seconds encode/60.0 seconds decode) Run on a 2.20ghz mobile i7 (8 cores) with 8gb of RAM Tested executable is FARI-x64-AVX.exe
Is there a reason why modeling and arithmetic coding tend to be combined? It seems like, logically, the actual arithmetic coder could be separated from the model.
The reason that the model and the arithmetic coder are combined is that this gave greater speed. Updating the model within the same routine allows further optimizations such as being able to eliminate a branch as well.
nburns (6th December 2013)
MTARI is a blockwise multithreaded version of FARI. A benchmark on enwik8 gives the following:
Code:ENWIK8 : 41,655,528 bytes ( 1.09 seconds encode/ 1.34 seconds decode) Run on a 2.20ghz mobile i7 (8 cores) with 8gb of RAM Tested executable is MTARI-x64.exe
I tested mtari on LTCB and Silesia.
http://mattmahoney.net/dc/text.html#3972
http://mattmahoney.net/dc/silesia.html
The supplied executables couldn't find msvcr120.dll, so I recompiled with MinGW gcc 4.8.0 -O2 -fopenmp. Without -fopenmp it runs single threaded but otherwise the same.
david_werecat (11th December 2013)
Thanks for testing it! Maybe I should start making static compiles so that it's not necessary to recompile it for anyone who doesn't have the latest runtime. It seems a little silly for Microsoft to keep on releasing newly named runtime dlls when there are no code breaking changes between versions. It would be easier to keep the same name and would save the redundancy of having to install multiple versions of the same runtime or bundling a portion of that runtime in every released application.
I have to release zpaq with static compiles using g++. Didn't used to have to do this.
With MinGW and GCC 4.8.1 I am getting error messages when trying to compile MTARI:
Code:C:\MinGW\bin>gcc mtari.c -O2 -fopenmp C:\Users\squeeze\AppData\Local\Temp\ccRn1E4w.o:MTARI.c:(.text+0x75): undefined r eference to `fa_compress' C:\Users\squeeze\AppData\Local\Temp\ccRn1E4w.o:MTARI.c:(.text+0x116): undefined reference to `fa_decompress' C:\Users\squeeze\AppData\Local\Temp\ccRn1E4w.o:MTARI.c:(.text+0x163): undefined reference to `fa_compress' C:\Users\squeeze\AppData\Local\Temp\ccRn1E4w.o:MTARI.c:(.text+0x1a3): undefined reference to `fa_decompress' collect2.exe: error: ld returned 1 exit status
The proper command line to compile is: gcc MTARI.c FastARI.c -O2 -fopenmp
Stephan Busch (11th December 2013)
using this command line to compile: gcc MTARI.c FastARI.c -O2 -fopenmp
and then using this command line to compress: E:\TESTSETS>timer mtari c app.tar app.mtari
leads to:
Exit code: -1073741819
Kernel Time = 0.125 = 1%
User Time = 4.656 = 73%
Process Time = 4.781 = 75%
Global Time = 6.319 = 100%
and using: E:\TESTSETS>timer mtari c app.tar app.mtari 4
keeps mtari.exe busy for hours. Archive size grows, but really slow.
david_werecat (12th December 2013)
Here's a new version that will hopefully fix the issue:
on my machine, the issue still exists with this version.
I just put together a quick plugin for Squash but I had to modify fa_decompress a bit in order to make sure not to attempt to decompress memory outside of the input buffer, and allow passing a larger-than-necessary output buffer (in case the decompressed size is not known). The version I used is at http://pastebin.com/vedsvCST—it's probably possible to do it a bit cleaner, but I was going for relatively non-invasive changes. I don't suppose you're interested in carrying that? I'd love to not have to maintain the patch :)
Also, do you plan on publishing a git repository? I'm trying to use git submodules for Squash and if there is going to be an official repository I don't want to create my own.
david_werecat (12th December 2013)
@Stephan Busch
Would it be possible to upload the file in question for testing purposes?
@nemequ
Thanks for the patch, I've included it in the project as the default decompression method while keeping the old method as an "unsafe" version. The project is now on github at https://github.com/davidcatt/FastARI. As for decompressing without a known decompressed length, this will most likely put a small amount of garbage at the end of the decompressed data since the codec doesn't have an EOF mechanism (rather it uses block sizes as a truncation method). I may implement an EOF mechanism in the future pending speed tests.
edit: I've implemented an EOF enabled version with functions fae_compress and fae_decompress.
Last edited by david_werecat; 12th December 2013 at 22:22.
nemequ (13th December 2013)
Its the App corpus (http://www.compressionratings.com/fi...zechart_app.7z) put in a single .TAR file.
david_werecat (12th December 2013)
Thanks for the test data. I haven't encountered an issue on my laptop with it, so I'm wondering if gcc handles openmp differently than msvc. Would you mind testing with this executable?
would you please also provide a file called vcomp120.dll which I could not find anywhere but which seems to be needed in order to run that executable.
Sorry about that, it appears that statically linking doesn't include the openmp library. Here's the dll:
Thank you. First tests indicate that this version runs on my system.
Sounds great, thanks! Unfortunately fae_decompress is failing my unit tests. I've split out a test case and added it to the issue tracker on your github repository. Once that gets resolved I'll go ahead and push the plugin and start benchmarking—I'll post again once I have the results.Quote Originally Posted by david_werecat View PostThe project is now on github at https://github.com/davidcatt/FastARI. As for decompressing without a known decompressed length, this will most likely put a small amount of garbage at the end of the decompressed data since the codec doesn't have an EOF mechanism (rather it uses block sizes as a truncation method). I may implement an EOF mechanism in the future pending speed tests.
edit: I've implemented an EOF enabled version with functions fae_compress and fae_decompress.
It seems unlikely that someone would be interested in all three versions at the same time—maybe it would be better to just use some preprocessor defines to switch between them so you can avoid code duplication? It's not issue from my point of view, but if I were in your shoes I know I wouldn't want to maintain three versions of basically the same thing.
I've fixed the issue, it turns out that it was hitting the end of the input buffer because the encoder wasn't flushing enough bytes. I'll also most likely add preprocessor defines in a future commit after reviewing each version of the functions.
I've compiled a static version with MSYS2(with pacman)/gcc 4.8.2 threads-posix, (64-bit:seh) / (32-bit:dwarf2)
-generic :-corei7-avx :Code:gcc MTARI.c FastARI.c -O3 -fopenmp -v -staticChecksums :Code:gcc MTARI.c FastARI.c -O3 -v -march=corei7-avx -mtune=corei7-avx -fopenmp -static
File;crc32;md5;sha1;sha2 (256)
Code:MTARI-0.2-x86-64.zip;7D5DE6B2;E60CCCA055B088A5A7F3AE00E5F112C5;130A9084A981ED3D4645CE6089047272B92C77F0;E3C37AC9F41363F03026641F0572B854EF9B6FFEFD8A59AD9D0B70B9F4BF3A63 MTARI-0.2-x86-64-corei7-avx.exe;38425D1D;6D248640CBACB946CF3FA9B42A0FD733;8DCCDC2F0F182F29D66DDA3764311169AE2DF21F;C9625365E959CF7AD05EEF69BAF308AFD4CEE62779005244B87E1D6CE3587309 MTARI-0.2-x86-64-generic.exe;1A44EA81;62CE5F253B6D1DDA1A0DE1970F3D796F;B06E031773B5B1390249B94CA9B18FC9B27A5886;D16F011A5DA0678EB8A5B18CCDFC5C420DD0BB479BB76D1DD59E138611A0B9DFCode:MTARI-0.2-i686.zip ;84ABBDB6;89EE530B8014A6775362491FECBA5667;0662947124795FE9F3E50C1A88F896E422113646;7230C9819102D1F23E7C5E6A37D86D65217687792C9161B414BD2ACF9713A34E MTARI-0.2-i686-corei7-avx.exe;CC7C8720;CBECC0FB579A2DE9BA0624E684C46A41;F387C4E6F0CC2A1072CA7E7B8E39EAAE7B160FAD;BE89083D2E2BDAB8794ECFE99E705C0CD9FC67CAC62A81C7277852207FBB448E MTARI-0.2-i686-generic.exe;BE683070;BBA7502D5F3408FA4DA3FDAFDF9468B8;B13A0F7178965FC5B133ACCD0029D1FDC0EE61CD;02972A320DEF9064E9B7A4B927CBBDB5360C93A6583BD296791268B318D3CD96
Last edited by samsat1024; 19th December 2013 at 01:39.
david_werecat (16th December 2013)
Bit late to this party.. but the GCC version is not doing well:
starting from offset 003FFFFE things go wrong (4G-2) . The MSVC build is ok. Tried several files.. all the same result.
Comparing files dec.tar.srep and DEC.TAR.SREP2
003FFFFE: FF 5C
003FFFFF: FF 00
00400000: 5C FF
00400001: 00 FF
00400002: FF 60
00400004: 60 C4
00400005: FF FE
00400006: C4 00
00400007: FE F6
00400008: 00 40
0040000A: 40 80
0040000C: 80 9C
0040000D: F6 11
Edit: to be more specific.. the core-i7-avx build.. not tested the others, but I suspect all gcc builds are flawed
E
Last edited by edcassa; 26th July 2015 at 16:36.
There is nothing wrong with GCC. Actually any build created under default conditions is flawed. No matter what compiler used. The only way to create a correct build is to compile with FA_USE_EOF definition. Otherwise you'll get either different size of decoded file, either differences between decoded and original file.
Its simply explainable. Buffer size of MTari is 0x400000.starting from offset 003FFFFE things go wrong
Here is the simple benchmark performed on WinXP x64 SP2, i7-2700K@4800, ramdisk, 4 threads for encoding\decoding, test file is 4 000 000 000 bytes.
MSVC and GCC compiles are taken from this thread, IC compiles created by me.
Although MSVC performs really good for MTari, there is no way to create completely static builds for OpenMP with MSVC.Code:IC11 x86 35.437s 39.593s MSVC x86 37.140s 44.515s GCC x86-generic 44.187s 55.703s IC11 x64 34.375s 42.031s MSVC x64-no-msvcr 30.453s 38.093s MSVC x64 32.328s 36.781s GCC x64-generic 36.500s 42.062s
Static IC binaries are below.