Hey all, I took a 1 + 1/2 year break from doing compression stuff but I recently started messing around with CM + LZP. Attached is MCM 64 bit.
The current version has 2 modes, -fast and -high.
-fast is 6 context CM + LZP
-high is 8 context CM + LZP (default -high)
The LZP is very basic, it just recycles the same bit processing as the normal CM. It uses 256 + expected char for the order 0 xor context instead of the normal 8 bit order o0.
If the match model has an expected char and the bit is a non match, it uses this info through sse to help reduce non match char costs. When there is no match model match, it just does a normal byte without processing an LZP bit.
Speed should be similar to mcm03, or a bit slower (for now). I wonder how ZCM has so much speed?
Changes:
Random tweaking, added SSE for helping the LZP.
Added -10 and -11 for more memory which uses around 2.8GB / 5.5GB (only on 64 bit versions).
Huffman disabled.
E8E9 filter (always on lol, works OK for most files unless lots of false positives).
64 bit support
linux support (./make.bat on ubuntu should compile).
Sample results with -9:
sfc: 10,382,334
enwik9: 160,663,427
enwik8: 19,505,098
silesia: 41,025,259
Source code is updated on my github:
https://github.com/mathieuchartier/mcm
I think there are probably more ways to improve the LZP by having less models / more specialized models for the LZP predictor bit.
birdie (30th September 2024),Bulat Ziganshin (5th February 2015),comp1 (6th February 2015),Matt Mahoney (5th February 2015),Nania Francesco (5th February 2015),Stephan Busch (5th February 2015),surfersat (6th February 2015)
Updated silesia and LTCB benchmarks.
http://mattmahoney.net/dc/silesia.html
http://mattmahoney.net/dc/text.html (went from 20'th to 12'th place :) )
Mat Chartier (5th February 2015)
Great to see this and to see you back. Don't take any more breaks from compression! :)Quote Originally Posted by Mat Chartier View PostHey all, I took a 1 + 1/2 year break from doing compression stuff but I recently started messing around with CM + LZP. Attached is MCM 64 bit.
The current version has 2 modes, -fast and -high.
-fast is 6 context CM + LZP
-high is 8 context CM + LZP (default -high)
The LZP is very basic, it just recycles the same bit processing as the normal CM. It uses 256 + expected char for the order 0 xor context instead of the normal 8 bit order o0.
If the match model has an expected char and the bit is a non match, it uses this info through sse to help reduce non match char costs. When there is no match model match, it just does a normal byte without processing an LZP bit.
Speed should be similar to mcm03, or a bit slower (for now). I wonder how ZCM has so much speed?
Changes:
Random tweaking, added SSE for helping the LZP.
Added -10 and -11 for more memory which uses around 2.8GB / 5.5GB (only on 64 bit versions).
Huffman disabled.
E8E9 filter (always on lol, works OK for most files unless lots of false positives).
64 bit support
linux support (./make.bat on ubuntu should compile).
Sample results with -9:
sfc: 10,382,334
enwik9: 160,663,427
enwik8: 19,505,098
silesia: 41,025,259
Source code is updated on my github:
https://github.com/mathieuchartier/mcm
I think there are probably more ways to improve the LZP by having less models / more specialized models for the LZP predictor bit.
Almost ready to release 0.81 which has a nice improvement for exe / binary and slight improvement on text. One thing I keep wondering about is how to improve performance without hurting compression much?
So far, my ideas for increasing speed are:
Faster LZP match bit prediction (hard to do without hurting compression).
Re-enable huffman preprocessing. However, the huffman benefits are mitigated by LZP and static huffman doesn't work really well for large files.
Multi-threading: Usually hurts compression, lot of work to implement.
Re-enable prefetching, probably only helps a little bit.
Use dictionary transform instead of LZP for text (hard to implement).
Anyone else have ideas?
Mat,Quote Originally Posted by Mat Chartier View PostAlmost ready to release 0.81 which has a nice improvement for exe / binary and slight improvement on text. One thing I keep wondering about is how to improve performance without hurting compression much?
So far, my ideas for increasing speed are:
Faster LZP match bit prediction (hard to do without hurting compression).
Re-enable huffman preprocessing. However, the huffman benefits are mitigated by LZP and static huffman doesn't work really well for large files.
Multi-threading: Usually hurts compression, lot of work to implement.
Re-enable prefetching, probably only helps a little bit.
Use dictionary transform instead of LZP for text (hard to implement).
Anyone else have ideas?
I don't have any specific ideas for you. But I would say that the speed of 0.80 is very good already. Already close in speed to CCMX and in many cases MCM produces a better ratio at -9 level than CCMX at 7 level.
Perhaps your goals for MCM are more focused on improving speed but improving compression while keeping current speed seems like a good idea to me.
Perhaps another option would be to incorporate other filters? CCMX excels in certain cases over MCM because of it's specific data filters. You mentioned EXE and text filters. What about a BMP or WAV filter? But, then again, perhaps what you are doing trying to improve upon the EXE and text filters beyond the 0.81 improvements is all that is necessary.
Sorry I cannot be of more help...I am not a scientist or a programmer.
Good luck and keep working on it! :)
If you implement the use of wildcards and recursion, you can let MCM switch off models that would give no gain on certain files
while switching those models on that would help - p.ex. an audio model on every WAV/AIFF, a PGM model for PGM, a PPM model for PPM images.
In both cases the text model can be switched off. While on texts, we would need the text model but not models for audio, PGM, PPM..
In my test cases, Huffman did not bring improvements.
If I remember right, programs like UHARC and CCMX tried the delta filter every few kb and checked if it gives gain.
If yes it will be enabled, if not, it is not enabled.
I think that dictionary transform would be much better for text than LZP. I like this idea.
I would also vote for a light Huffman recompressor.
mcm v0.81:
Significant improvements to binary models, slight improvements to text model. Added prefetch to match model and a few other things for speed tuning.
Added -turbo (3 contexts), -fast (4 contexts), and -max (10 contexts)
Reference mcm v0.8:
SFC -9:
-high 10,382,334 in 38.198s
Silesia -9:
-high 41,025,259 in 143.9s
New mcm v0.81:
SFC -9:
-turbo (3 ctx): 10,733,617 in 21.629s
-fast (4 ctx): 10,555,083 in 26.109s
-mid (6 ctx): 10,313,325 in 31.408s
-high (8 ctx): 10,251,772 in 35.866s
-max (10 ctx): 10,208,213 in 38.358s
Silesia -9:
-fast: 42,200,550 in 91.2s
-mid: 40,632,970 in 116.956s
-high: 40,330,795 in 136.879s
-max: 40,151,163 in 148.767s
Turns out testing whether a bit is set is fast enough to do a few times per byte.
I added an easy way to programmatically enable / disable models. Example of current models for binary profile:
if (inputs > idx++) enableModel(kModelOrder1);
if (inputs > idx++) enableModel(kModelOrder2);
if (inputs > idx++) enableModel(kModelSparse34);
if (inputs > idx++) enableModel(kModelOrder4);
if (inputs > idx++) enableModel(kModelSparse23);
if (inputs > idx++) enableModel(kModelMask);
if (inputs > idx++) enableModel(kModelSparse4);
if (inputs > idx++) enableModel(kModelOrder3);
if (inputs > idx++) enableModel(kModelSparse2);
if (inputs > idx++) enableModel(kModelSparse3);
I made a branch for v0.81, so that when I do incremental development, the source is still visible.
https://github.com/mathieuchartier/mcm/tree/v0.81
Stephan Busch (10th February 2015)
Mat,
Great improvement in compression!
HFCB's vm.dll
Again, excellent improvement!Code:SIZE TIME PROGRAM ========================================= 958143613 1990.67500 mcm 0.80 -9 -high 881842673 2126.16200 mcm 0.81 -9 -max
Mat Chartier (11th February 2015)
MCM v0.82, minor update:
Updates:
Tuned x86 filter to have less bad transforms on non x86 data.
Tuning some constants through hill climb / brute force, improved match model handling.
More prefetching + other optimizations (~5-10% faster vs 0.81).
TODO:
Rewrite text / binary detector + add more filters (lot of work).
Enwik8 -9 -high:
v0.80: 19505098 in 90.556s
v0.81: 19381820 in 85.029s
v0.82: 19318179 in 77.556s
SFC -9:
-mid: 10,307,681
-high: 10,246,239
-max: 10,207,930
Silesia -9:
-mid: 40,541,614
-high: 40,251,527
-max: 40,091,648
Source at: https://github.com/mathieuchartier/mcm/tree/v0.82
Cyan (16th February 2015),Nania Francesco (16th February 2015),Stephan Busch (16th February 2015)
Nice improvement Mat :)
Also, I did some other quick tests and MCM 0.82 beats ZCM and CCMX every time for compression ratio. You are on your way to beating NanoZip. Once data type detection/filters are implemented, I believe you will beat NanoZip for ratio (since you have already beaten it for speed).Code:SIZE TIME PROGRAM ========================================= 881842673 2126.16200 mcm 0.81 -9 -max 879245585 1846.88800 mcm 0.82 -9 -max
Don't give up on this project. :)
Last edited by comp1; 18th February 2015 at 07:54.
I am proud to announce mcm v0.83 which has nice improvements in text.
New features:
2 pass approach: analyze then compress streams one at a time
Dynamic dictionary preprocessor using 0 - 600000+ words depending on file.
Special .wav compressor (not great ratio, but fast at >10MB/sec).
-store option for using just filters and dictionary preprocessor (undocumented)
New LZP idea for text, only use LZP if the match length is greater than ~13, otherwise use match model. This works better than never using LZP according to my tests.
Sample results:
calgary.tar -x9: 667,097 in 3s
silesia -m9: 40,340,745 in 102s
silesia -x9: 39,772,294 in 135s
enwik8 -m9: 18,623,547 in 51s
enwik8 -m9: 18,376,397 in 69s
Future ideas:
Is there a way to do a per sample CM to for audio samples? Logistic mixing makes no sense, but something like paq6 linear mixing might work. I want a way to have negative weights though.
Caution:
This version has a lot of changes and could be buggy. The dictionary building memory usage could be HUGE if the file is very large. Also 32 bit version isn't very tested and might be broken on >4GB files.
Source at:
https://github.com/mathieuchartier/mcm/tree/v0.83
Bulat Ziganshin (10th April 2015),comp1 (5th April 2015),Jan Ondrus (14th May 2015),Nania Francesco (5th April 2015),Stephan Busch (5th April 2015)
Good improvement!
Keep working on it! :)Code:SIZE TIME PROGRAM ========================================= 879245585 1846.88800 mcm 0.82 -9 -max 874226934 1336.988 mcm 0.83 -x9
Mat Chartier (5th April 2015)
I am glad that you have heard suggestions by inserting the LZP in the program. Definitely did not expect these great strides. Only one thing I wanted you to consider: I've seen that uses float numbers yet and I don't think this makes it compatible with all CPUs mcm (you may test the archives with various CPUs). However the strength of the program is definitely in text processor which is a strong point (I hope soon to see the results of Matt on LTCB)
Last edited by Nania Francesco; 10th April 2015 at 18:27.
Bulat Ziganshin (10th April 2015),Mat Chartier (10th April 2015)
Hi Nania, I think the latest versions "should" not use floats any more, where did you see the float numbers in the coding process? I moved to using stems for squash / stretch a while back. One thing you may have seen is that dictionary preprocessing is non deterministic, i.e. 32 bit mcm has a slightly different dictionary than 64 bit. However this is benign since I encode the dict at the start of the file.
I still feel like the MCM LZP is not as strong as ZCM's LZP, for example, FP.log is slower and less compressed with MCM (use -filters=none) than ZCM 0.92. When you encode LZP, do you use a single bit for LZP / no LZP, or encode a whole number length for the match length.
I'm glad you removed the float calculations (I was wrong evidently). I'm at work to version 0.93 of ZCM and it will be difficult to compete with current MCM.
O__O
Strange that its gone unnoticed but MCM 0.83 -x11 enters the Top-10 of LTCB.
So, with overall size of 144 934 149 bytes, MCM 0.83 -x11 takes 7th place beating such monsters as NanoZip, xwrt, fp8 and WinRK.Code:mcm 0.83 x64 -x11 enwik8 18 233 295 44.500s 5839 MB 40.203s 5707 MB mcm 0.83 x64 -x11 enwik9 144 854 575 394.421s 5961 MB 281.250s 5723 MB Decompresser is 79 574 (sources from GitHub, size optimized by removing comments, KZIP /b512)
Considering the speed MCM reaches this result with, its simply astonishing.
Decompression is correct, tested on i7-2700K, WinXP x64 SP2. Size optimized sources attached.
comp1 (22nd April 2015),Cyan (23rd April 2015),kaitz (22nd April 2015),Mat Chartier (23rd April 2015),willvarfar (23rd April 2015)
Thanks for measuring this, I think there is still a fair bit of area for improvement by better assigning codes for the dictionary. Keep in mind that beating those compressors but using 3-4x the RAM isn't exactly apples to apples :)
Right now I have basic multi file support done for windows, and I am also working on simple block based dedupe. These will be part of 0.84, but I don't think there will be any significant changes to the dictionary or CM core.
EDIT:
Thx Matt, I'll try to make sure that is fixed in the next version.
Last edited by Mat Chartier; 23rd April 2015 at 00:35.
It's about time I updated LTCB. @Skymmer, thanks for testing. http://mattmahoney.net/dc/text.html#1449
mcm32.exe -x9 failed on silesia/dickens. Decompression stalled, and when I killed it the output was the right size but all zero bytes. The rest of silesia was OK.
Hi Mat,
Thanks for mcm. Can you help with compilation on Linux and FreeBSD (and, lowest priority, OS X)? It seems that make.sh refers to a missing File.cpp. And then clang++ gave some errors on OS X. Also if anyone else has gotten it to compile and might have a static executable that would be great.
I don't see File.cpp in make.sh. I just tried again in 64 bit Ubuntu/g++ 4.8.2 and make.sh works both on the github sources and Skymmer's optimized sources. make.sh looks like this:
g++ -DNDEBUG -O3 -fomit-frame-pointer -msse2 -std=c++0x -D_FILE_OFFSET_BITS=64 -o mcm CM.cpp Archive.cpp Huffman.cpp MCM.cpp Memory.cpp Util.cpp Compressor.cpp LZ.cpp -lpthread
Seems like you're using an old checkout. Current git looks like this:Quote Originally Posted by Matt Mahoney View PostI don't see File.cpp in make.sh. I just tried again in 64 bit Ubuntu/g++ 4.8.2 and make.sh works both on the github sources and Skymmer's optimized sources. make.sh looks like this:
g++ -DNDEBUG -O3 -fomit-frame-pointer -msse2 -std=c++0x -D_FILE_OFFSET_BITS=64 -o mcm CM.cpp Archive.cpp Huffman.cpp MCM.cpp Memory.cpp Util.cpp Compressor.cpp LZ.cpp -lpthread
The issue has already been reported (just over two weeks ago).Code:g++ -DNDEBUG -O3 -fomit-frame-pointer -msse2 -std=c++0x -D_FILE_OFFSET_BITS=64 -o mcm CM.cpp Archive.cpp Huffman.cpp MCM.cpp Memory.cpp Util.cpp Compressor.cpp LZ.cpp File.cpp -lpthread
Did I miss something? What is mcm32 in your .7z distribution - assumed a 32 bit version. However doesn't work on xp- 'not a win32 program'. John
i'm adding these flags when compiling with msvc:
-link -MACHINE:x86 -SUBSYSTEM:CONSOLE,5.01 -LARGEADDRESSAWARE
-link -MACHINE:x64 -SUBSYSTEM:CONSOLE,5.02
on the built executable you can do
editbin.exe a.exe /SUBSYSTEM:CONSOLE,5.01 /LARGEADDRESSAWARE
thanks - so can the distr plese be updated Matt, mcm32 runs on xp. TIA John
Yes, I was using an old checkout.
Quote Originally Posted by Matt Mahoney View PostI don't see File.cpp in make.sh. I just tried again in 64 bit Ubuntu/g++ 4.8.2 and make.sh works both on the github sources and Skymmer's optimized sources. make.sh looks like this:
g++ -DNDEBUG -O3 -fomit-frame-pointer -msse2 -std=c++0x -D_FILE_OFFSET_BITS=64 -o mcm CM.cpp Archive.cpp Huffman.cpp MCM.cpp Memory.cpp Util.cpp Compressor.cpp LZ.cpp -lpthread
When i compile under Linux Mint Cinnamon 17.1 i got this:
-------------------------------------------------------------------------------------------
blackhole@galaxy ~/mcm $ ./make.sh
g++: error: File.cpp: No such file or directory
This the content of "make.sh":
----------------------------------------------
g++ -DNDEBUG -O3 -fomit-frame-pointer -msse2 -std=c++0x -D_FILE_OFFSET_BITS=64 -o mcm CM.cpp Archive.cpp Huffman.cpp MCM.cpp Memory.cpp Util.cpp Compressor.cpp LZ.cpp File.cpp -lpthread
When i remove "File.cpp" from "make.sh" i got this:
--------------------------------------------------------------------------------
/tmp/ccDDyso4.o: In function `Archive::writeBlocks()':
Archive.cpp.text+0x289c): undefined reference to `FileList::write(Stream*)'
/tmp/ccDDyso4.o: In function `Archive::readBlocks()':
Archive.cpp.text+0x2caa): undefined reference to `FileList::read(Stream*)'
/tmp/ccDDyso4.o: In function `Archive::list()':
Archive.cpp.text+0x2d6f): undefined reference to `FileInfo::attrToStr(unsigned short)'
/tmp/ccDDyso4.o: In function `Archive::compress(std::vector<FileInfo, std::allocator<FileInfo> > const&)':
Archive.cpp.text+0x3aa9): undefined reference to `FileList::addDirectoryRec(std::string const&, std::string const*)'
/tmp/ccE9QpCX.o: In function `Options:arse(int, char**)':
MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0xf: undefined reference to `FileInfo::FileInfo(std::string const&, std::string const*)'
MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0x3e6): undefined reference to `FileInfo::FileInfo(std::string const&, std::string const*)'
MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0x476): undefined reference to `FileInfo::FileInfo(std::string const&, std::string const*)'
MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0x600): undefined reference to `FileInfo::FileInfo(std::string const&, std::string const*)'
MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0x689): undefined reference to `FileInfo::FileInfo(std::string const&, std::string const*)'
/tmp/ccE9QpCX.o:MCM.cpp.text._ZN7Options5parseEiPPc[_ZN7Options5parseEiPPc]+0x9a9): more undefined references to `FileInfo::FileInfo(std::string const&, std::string const*)' follow
/tmp/ccE9QpCX.o: In function `main':
MCM.cpp.text.startup+0x9a): undefined reference to `FileInfo::FileInfo()'
collect2: error: ld returned 1 exit status
Note:
-------
take in consideration that it's not an option to use the old "File.hpp" since that one contain a different version of the "class FileInfo, FileList"
i hope to take this issue in consideration and thank anyway for that beautiful code .
MCM v0.83 skbuild1
- now you can set memory level up to 13 allowing slightly better compression (be carefull, too memory hungry!)
- console output have been redesigned for better viewing but also it reveals some undocumented options for more precise control over filtering and lzp
- 32bit compile is static GCC instead of MSVC so no more requirements for MSVC runtimes
This is remarkable. MCM took a 279 KB CreateJS.js file down to 35 KB. The best I could do with 7Z was 46 KB.
NOTE: It wouldn't compress JS files directly. I guess it's not supported. I had to convert them to .txt first.
It would help if you could post the command line syntax and arguments. I had some trouble there. And the supported source file formats. I understand if you don't want to bother with direct JS support, since on the web we're forced to deliver JS as gzip anyway, but it would be helpful to know all the supported file types.
I really wish we could get some movement on web compression formats beyond gzip. I'm not a gzip hater – it's unbelievably efficient and light, amazing that something written in the 80s and 90s could still be so hard to beat on CPU and memory use at a given comp ratio and decomp speed. But certainly we must be able to develop a format at this point that can match gzip CPU, memory efficiency, and decomp speed, while picking up a big win on compression ratio. MCM seems to have potential there.
Matt Mahoney, thanks for your benchmarks page. Question: Is the mem column really supposed to be in megabytes, as indicated in your legend? If so, it means lots of programs are using many GBs of RAM, like cmix v6 using 30+ GBs (first row). Most PCs don't have the amount of memory this table suggests these programs are using.
Relatedly, could you post the specs and config for the machine you're using for the benchmarks? Sorry if it's already posted and I missed it...
Yes, mem is in MB, and yes, cmix requires 30 GB of RAM. The note column shows the spec for the test machine. Most of my recent tests are on a Linux machine (note 48 ), but some of my older tests are on other machines. I realize that speed depends on CPU and compression ratio depends on memory. I don't want to put any limits on hardware. If your compressor is top ranked but only runs on a quantum computer :D then we learned something.