Hello Every, Here is my CSC 3.2 Alpha 2.
Here is csc on compression ratings
It's a complete rewritten of CSC3.1, aiming at better ration and faster speed.
The main algorithm of them are almost same -- LZ77 + Ari.
But to achieve a better performance, I made such changes:
1. Abandon HashChain match finder. Use hashTable instead. And the number of match candidates is only 1-6 (hidden parameter -mx is 12). That means in -m1 there is only 1 hash-6-bytes candiate on each string matching.
2. On most data, LZ is very good. But on bad data like audios. It is very slow. So only find matches on the beginning of small blocks and insert hash-6 strings every several bytes. Then it would performs like skip (30MB/s+) but can find duplicate blocks. (try Tar contains several same zipped files)
3. Detection: CSC3.1 calculates each column's(12円3円4円8円16円) entropy to determine if delta is needed every 64KB. It's slow. Now such code instead:
while(progress<size-12)
{
sameDist[0]+=abs((signed)src[progress]-(signed)src[progress+1]);
sameDist[1]+=abs((signed)src[progress]-(signed)src[progress+2]);
sameDist[2]+=abs((signed)src[progress]-(signed)src[progress+3]);
sameDist[3]+=abs((signed)src[progress]-(signed)src[progress+4]);
sameDist[4]+=abs((signed)src[progress]-(signed)src[progress+8]);
sameDist[5]+=abs((signed)src[progress]-(signed)src[progress+12]);
progress++;
}
The MinBlockSize is 8KB. If largest sameDist[i] and smallest sameDist[i] differs much that often indicates this block is tables and smallest sameDist[i] is the Channel Num.
4. Improved parsing and model. There are 5 kinds of packes in LZ out stream:
Normal Match, Matches with Repeat Distance (4), 1-Byte match (last distance)(helps much in data tables), Same match(last len and distance), literal.
Literal use order-1 (3MSBit as context). Match distances coding use match len as context.
5. Range coder changed to the same as LZMA. Seems 10-20% faster on decompression compared to CSC31's coder.
6. The dictionary size is enlarged much ---320MB (actually 500MB+ but seems meaningless). Due to the changes in match finder, memory usage is only 40M+dictionary size in -m1 mode. Decompression needs 5M+dictionary size, i/o buffer included.
Still need to improve:
LZ Parsing has bugs so far. Its price evaluation doesn't work. Check e.coli -- Theorical should has no matches and be compressed to 2bpb.
The delta coding is far from statisfied. Thanks to Nania, but his works distinguish datas by header of file. So what sould I do if I don't know the width of bitmap? Sami advices linear preditor + order 0 is good enough. But I lack the knowledge of linear preditor. Also change CSC to archiver is a kind of solution.
Technical changes in CSC3.2 Alpha 3:
1. Removed repeat match in LZ. Very little ratio improved on most files.
2. Changed literal coding a little lead to very little ratio improved on all files.
3. Changed delta coding. Improved decoding speed.
4. Cached some bits in matchfinder, improved compression speed.
Technical changes in CSC3.2 Alpha 4:
1. Improved compression speed on data that hard to compress
2. Fixed a bug that hurts compression.
3. Reset mode: -m0 -> -m1, -m1 -> -m2, -m2/3 -> -m3. default -m2 now.
4. Dictionary size is more flexible now, can be any KB ranges from 32KB to 512MB.
Technical changes in CSC3.2 Alpha 5:
1. Improved compression speed a little.
2. A very weak and experimental DICT (not bulat's :) ) preprocessor added. Improved %1 ratio on text.
Technical changes in CSC3.2 Alpha 6:
1. Very small improvement, most about DICT
2. This is the last version of CSC. Next version will be an archiver, and maybe called CSArc.
Technical changes in CSC3.2 Beta 2:
New version emphasize more on compression ratio rather than compression efficiency.
1. Rewrite LZ part for -m2/-m3 mode. Removed previous -m1 mode,now -m1 is previous -m2.
Now the LZ(-m2/3) use better parsing, but much slower than before.
Now literal coder is complete order-1 while previous use only 3-MSB.
2. New delta detector, but seems not better than before.
3. Now it's an archiver, many codes are from YZX(or from Sami Runsas).
4. Some other small modification.
Changes in CSC3.2 Final:
Fixed decompression error in previous version.
Last edited by Fu Siyuan; 1st March 2011 at 17:35. Reason: Update
http://en.wikipedia.org/wiki/Linear_prediction
Also, did you look at the bsc source?
Its compression methods may be different, but detectors and filters
can be used as a reference.
1 - This IS NOT entropy, you have used some kind of heuristics.Quote Originally Posted by Fu Siyuan View Post3. Detection: CSC3.1 calculates each column's(12円3円4円8円16円) entropy to determine if delta is needed every 64KB. It's slow. Now such code instead:
while(progress<size-12)
{
sameDist[0]+=abs((signed)src[progress]-(signed)src[progress+1]);
sameDist[1]+=abs((signed)src[progress]-(signed)src[progress+2]);
sameDist[2]+=abs((signed)src[progress]-(signed)src[progress+3]);
sameDist[3]+=abs((signed)src[progress]-(signed)src[progress+4]);
sameDist[4]+=abs((signed)src[progress]-(signed)src[progress+8]);
sameDist[5]+=abs((signed)src[progress]-(signed)src[progress+12]);
progress++;
}
The MinBlockSize is 8KB. If largest sameDist[i] and smallest sameDist[i] differs much that often indicates this block is tables and smallest sameDist[i] is the Channel Num.
2 - Linear predictors are not very complex. Just think a series of coefficients and you just apply convolution at each step with those coefficients. You can even use heuristics to update coefficients if you want to make it adaptive. Or you can use least squares method to obtain coefficients.
3 - "Same match(last len and distance)" option seems unappropriate. Because, it does not occur too much IIRC. You can dump statistics of some files to see it really helps or not.
BIT Archiver homepage: www.osmanturan.com
I just downloaded bsc source. It's a nice reference of detectors. :)
To Osman:
1. You misread it. CSC3.1 uses entropy, this does not. However, I finally understands your metioned much the word "heuristics".
2. I will have a try, I also read the RAR audio filter. "heuristics" often works for me.
3. Yes, you are right. Same match only benefits on a few files. I will fixed it in next alpha edition.
are you seen mmdet and delta in freearc sources?
are you use "caching"? i think that saving 4 bytes of each match as check key along with pointer should reduce number of extra memory accesses to minimum. and since L2 cache line is 64 byte, you may access 8 matches without extra penalty:That means in -m1 there is only 1 hash-6-bytes candiate on each string matching.
Code:hash = hash_func(p[0]...p[5]) key = *(uint32*)p for (i=0; i<8; i++) { if (ht[hash][i].key == key) { match_ptr = ht[hash][i].ptr match_len = memcmp_bytes(p, match_ptr) .... } }
Last edited by Bulat Ziganshin; 9th May 2010 at 11:27.
Truly speaking, it's not easy for me to understand all your ideas though I always keep trying to. At the same time I also try some my own ideas. BTW, your dict preprocessor attracts me more and I'm consider implement one.Quote Originally Posted by Bulat Ziganshin View Postare you seen mmdet and delta in freearc sources?
of course it's hard since my comments are in russian rather than chineseQuote Originally Posted by Fu Siyuan View PostTruly speaking, it's not easy for me to understand all your ideas though I always keep trying to. At the same time I also try some my own ideas. BTW, your dict preprocessor attracts me more and I'm consider implement one.
dict implementation is amazingly optimized by speed but its algorithm isn't ideal. look at data produced by WRT: prerocessing by WRT makes further ppmd compression faster while preprocessing by DICT makes it slower
Last edited by Bulat Ziganshin; 9th May 2010 at 11:36.
Based on my observation. Hash miss (find same hash but not a match) isn't much so I didn't use "if (ht[hash][i].key == key)"hash = hash_func(p[0]...p[5])
key = *(uint32*)p
for (i=0; i<8; i++) {
if (ht[hash][i].key == key) {
match_ptr = ht[hash][i].ptr
match_len = memcmp_bytes(p, match_ptr)
....
}
}
In my program, i<1(-m1) 3 (-m2) 6(-m3). However, more candidates still slows much, not because memory access. Checking longer matches itself cost obvious long time.
Another test: Hash8-byte is very helps on enwik8/9. But get much worse ratio on binary.
every memory access costs ~100 ticksQuote Originally Posted by Fu Siyuan View PostHash miss (find same hash but not a match) isn't much
much more than 100 ticks? ;) taking into account that most of checked matches will be less that 4 bytesHowever, more candidates still slows much, not because memory access. Checking longer matches itself cost obvious long time.
you may gather stats, for example on ewnwik8 and my dll100.dll with 1..8 matches checked: % of hash misses, average match_len (in means of my code)
many years ago i've found the same with hash4 matchfinder instead of hash3 widely used those times :)Another test: Hash8-byte is very helps on enwik8/9. But get much worse ratio on binary.
now i think that one should use separate hash table that fits in L2 chache. this means about 2^20 entries, so it may be used only for matches up to 5 or 6 bytes. btw, it's better organized as hc4 since it doesn't suffer from main hc4 problem - lot of memory access
although if one need maximum-maximum compression w/o bt4, she may use separate hash with size up to 100 mbytes, so it may find matches up to 7-8 bytes
Last edited by Bulat Ziganshin; 9th May 2010 at 12:55.
"linear predictor + order0" is very simple thing. you subtracts successive values (over N bytes) - this is called linear predictor, and then computes order0 compression of resulted dara. what N gives best compression and if it gives better compression than order0 of original data - it should be used. it's how mmdet works - you may run its executable ( http://freearc.org/Research.aspx ) on wav/bmp files
so, delta preprocessing by itself is called linear predictor :) simplest way to check whether it helps is to compare order0 compression before and after applying it
Yeah, I know delta is a kind of linear predictor. Matt explained this in his article. But only subtracts successive values is not enough to get a good performance on wav/bmps. Current CSC3.2 use delta and order1, it's slow and performs not as good as RAR, expecially on bitmap. CSC3.1 does what your said. The new method of detection is a trying. It should be simpler and faster.Quote Originally Posted by Bulat Ziganshin View Post"linear predictor + order0" is very simple thing. you subtracts successive values (over N bytes) - this is called linear predictor, and then computes order0 compression of resulted dara. what N gives best compression and if it gives better compression than order0 of original data - it should be used. it's how mmdet works - you may run its executable ( http://freearc.org/Research.aspx ) on wav/bmp files
so, delta preprocessing by itself is called linear predictor :) simplest way to check whether it helps is to compare order0 compression before and after applying it
i talked only about detector, since you said that you don't understand thats's "linear predictor + order0"
overall, MM compression is its own area and lz+filter is a very poor men's approach here :)
Hi Bulat:
Thanks for your advice about caching. To avoid use too much memory, I use only 3bits to store the key (3bit key+29 bit dictionary size). On many test cases, the program speeds up by nearly 10%.
csc32 tests. http://mattmahoney.net/dc/text.html#2502
Very very nice results for this version!
Super and fast LZ77! hi nice job!
Some tests:
TCUP:
No code has to be inserted here.Exe:
No code has to be inserted here.Sorry, no compression timings, but it felt fast.
New update of my program, with source code.
Use a better flexible parsing.
Now it's ok. I'm not sure if the codes work without Visual C++ 2008
Last edited by Fu Siyuan; 22nd December 2010 at 16:12.
Updated results for csc32 final. http://mattmahoney.net/dc/text.html#2299
Any progress on it? :)