TANGELO is single file compressor (not archiver) derived from PAQ8/FP8 licensed under GPL.
I removed a lot of stuff from FP8 to make it as simple as possible so it has small source code and it is easier to understand how its core works (i think). Compression engine should be still same as the one in FP8.
Specialized models/transformations for EXE / Images / Audio / JPEG / ... are all removed. You can't pack multiple files with TANGELO. You can't select memory it uses about 550/600mb (same as FP8 with option -7).
It source is about 23kb (compared to 149kb for FP. It should have similar performace as FP8 on text and unknown/default data.
Code:Usage: TANGELO <command> <infile> <outfile> <Commands> c Compress d Decompress
Bulat Ziganshin (17th June 2013),encode (18th June 2013),Mat Chartier (17th June 2013),Matt Mahoney (17th June 2013),Mike (19th June 2013),Nania Francesco (25th June 2013),samsat1024 (7th July 2013),Skymmer (20th June 2013),Stephan Busch (18th June 2013)
Updated Silesia benchmark. http://mattmahoney.net/dc/silesia.html
Compared to fp8_v3 -7, compression is better on structured text (nci and webster) but worse on x86 (ooffice, I guess because no e8e9 filter).
LTCB will have to run overnight.
Edit: LTCB updated. http://mattmahoney.net/dc/text.html#1532
Speed is about the same as fp8_v3 -8 (5.5 hours to compress or decompress enwik9) but compression is a bit worse due to using only half as much memory.
Last edited by Matt Mahoney; 18th June 2013 at 17:28.
Jan Ondrus (18th June 2013)
drt|tangelo would probably compress enwik8 and/or enwik9 tighter than drt|lpaq9m, while using almost 3 times less memory.
This newsgroup is dedicated to image compression:
http://linkedin.com/groups/Image-Compression-3363256
Ran it with DRT on enwik8, enwik9:
enwik8: drt|tangelo 17681785 bytes in 809.17s
enwik9: drt|tangelo 148758265 bytes in 8153.09s
Decompression not verified. Computer: Core i7 2630QM - 8 GB ram
Very nice compression!
Some results with drt + various compressors (as of June 2010). http://mattmahoney.net/dc/text.html#1440
lpaq9m is tuned for drt output on enwik8/9.
Nania Francesco (25th June 2013)
It doesn't look like it's heavily tuned. Slightly more than the following two are:Quote Originally Posted by Matt Mahoney View Postlpaq9m is tuned for drt output on enwik8/9.
Compressor ... ratio (dic+drt compressed size divided by enwik8 compressed size)
paq8px_v67 ... 0.9480
paq8l ... 0.9483
...
lpaq9m ... 0.9478
(from the last table in http://mattmahoney.net/dc/text.html#1440 )
Last edited by Alexander Rhatushnyak; 26th June 2013 at 01:44.
This newsgroup is dedicated to image compression:
http://linkedin.com/groups/Image-Compression-3363256
TANGELO 2.0
- removed APMs
- removed some modeling (simpler model)
- more simple StateMap and ContextMap
- removed DMC Model
- using less memory and faster
- less compression
- state table from Mat Chartier from this thread http://encode.su/threads/1742-Improv...state-machines
Matt Mahoney (7th July 2013),Nania Francesco (6th July 2013),samsat1024 (7th July 2013)
Thanks Jan for the great job you're doing but I think you can enter you as the creator of the program which is very different from Paq8 although similar. I think you should put a LZP to make it as fast as Paq9 (you could decrease the contexts and take only the most significant ones). You need create an archiver (Sami Runsas had put online one free) and how would you improve the solid 10-20%. I would like to work with you, Matt and Mat Chartier for a super archiver!
WCC2013 results are excellent!
Best Regards, Francesco!
I don't think i will have time for developing new program. But i have one idea i want to experiment - use static huffman coding before modeling and context mixing to improve speed on redundant data (less bits will be modeled mixed and coded for each byte on average). It would be somewhat similar to how huffman coded data in paq8 JPEG model are handled. Did someone of you guys tried something like that? What do you think?Quote Originally Posted by Nania Francesco View PostThanks Jan for the great job you're doing but I think you can enter you as the creator of the program which is very different from Paq8 although similar. I think you should put a LZP to make it as fast as Paq9 (you could decrease the contexts and take only the most significant ones). You need create an archiver (Sami Runsas had put online one free) and how would you improve the solid 10-20%. I would like to work with you, Matt and Mat Chartier for a super archiver!
WCC2013 results are excellent!
Best Regards, Francesco!
I think the answer from you alone. It seems to speak of Kung-fu. If you can't beat the enemy, becomes his friend. Simply copy the data with medium probability as in CSC 3.2! With Huffman would you do a big hole in the water!
A different alphabet decomposition (mapping of symbols from a N-ary alphabet to a set of prefix codes) certainly is a good idea. I've implemented order-1 Huffman decomposition (256 Huffman trees, one per order-1 context) in my old M1 and M1x2 compressors, see my homepage and check the most recent version. There was a speedup of roughly 30 - 50% and compression remained almost the same. I guess in your case it'll be bigger, since i mixed at most four models. However the way you group symbols of same Huffman code length has some influence on compression. I use a heuristic called "Huffman-III decomposition": http://www.sps.ele.tue.nl/members/f....CTW/Ben99x.pdf.
Hope this helps.
Cheers
M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk
Updated LTCB (so far just enwik and Silesia.
http://mattmahoney.net/dc/text.html#1532
http://mattmahoney.net/dc/silesia.html
Depends on what you want to achieve. If you are doing backups, then the most important speed optimizations are detecting already compressed data (to store) and deduplication. This is because on typical disks, most of the data is already compressed and there tends to be a lot of space wasted in extra copies of files. The next best tricks are e8e9 transform (because x86 and x86-64 is a common uncompressed type) and grouping small files that have the same extension to compress together. Most compressible data is binary rather than text, so it is useful to have sparse models and fixed record size models, and you don't need a lot of memory. (An exception is DNA). Most benchmarks are not realistic in this sense, that they tend to have large text files, no duplication, and exclude already compressed files.Quote Originally Posted by Jan Ondrus View PostI don't think i will have time for developing new program. But i have one idea i want to experiment - use static huffman coding before modeling and context mixing to improve speed on redundant data (less bits will be modeled mixed and coded for each byte on average). It would be somewhat similar to how huffman coded data in paq8 JPEG model are handled. Did someone of you guys tried something like that? What do you think?
Edit: updated enwik9. http://mattmahoney.net/dc/text.html#1532
Last edited by Matt Mahoney; 8th July 2013 at 23:42.
Here is version TANGELO 2.1.
It is faster with weaker compression again.
changes:
- one mixer used for one bit (is selected from 256 possible mixers by previous byte as context)
- removed modeling except match,order0,1,2,3,4,6
- higher order models (2,3,4,6) should be disabled for random-looking (already compressed) data for better speed
- probabilities for states are now fixed (StateMap class replaced by array of probabilities)
Mat Chartier (22nd July 2013)
Updated LTCB and Silesia benchmark.
http://mattmahoney.net/dc/text.html#1532
http://mattmahoney.net/dc/silesia.html
Made a tiny change, now it works on x64 platforms (and it is able to reserve more than 2Gb memory), without harming the initial setup.
Jan Ondrus (21st July 2013)
TANGELO 2.3
- (re)added simple APM for better compression
- some small changes for better speed
Updated LTCB and Silesia benchmarks.
http://mattmahoney.net/dc/text.html#1532
http://mattmahoney.net/dc/silesia.html
this tangelo 2.3 compile keeps crashing on my system - which version of libstdc++-6.dll is needed?
ah.. fixed..
I had version from 21.09.2011
it works with version from 16.10.2012
Last edited by Stephan Busch; 25th July 2013 at 02:32.
I forgot to mention that tangelo.exe did not run because it was looking for some cygwin DLL files. I recompiled it from source for the test. The problem could be fixed by compiling with -static.
@Jan Ondrus, are you sure the construction around 'bytes_read' and 'bytes_written' is working properly? With 'enwik8' variable 'rn' does not change ...
@mahessel, rn should only become one if tangelo detects that the input is compressing poorly. It is expected that rn should never change on highly compressible files such as text/xml.
Yes, exactly.Quote Originally Posted by Mat Chartier View Post@mahessel, rn should only become one if tangelo detects that the input is compressing poorly. It is expected that rn should never change on highly compressible files such as text/xml.
Okay, clear.
Changing the squash table into '0,2,6,11,20,33,52,82,126,193,290,430,626,888,1222 ,1616,2048,2479,2873,3207,3469,3665,3805,3902,3969 ,4013,4043,4062,4075,4084,4089,4093,4095' will improve the compression with about 30KB on enwik9 :):)
Do you know why?Quote Originally Posted by mahessel View PostOkay, clear.
Changing the squash table into '0,2,6,11,20,33,52,82,126,193,290,430,626,888,1222 ,1616,2048,2479,2873,3207,3469,3665,3805,3902,3969 ,4013,4043,4062,4075,4084,4089,4093,4095' will improve the compression with about 30KB on enwik9 :):)
First, the curve should start with 0 and end with 4095 (just like the limiter in squash).
Second, the curve flow is a choice. In this case a less steeper slope preforms better.
For example, you can tweak the curve by using:
double t[4096];
for (int n = 0; n < 4096; ++n) {
t[n] = 1.0 / (1.0 + exp ((2048.0 - n) / TWEAKME));
}
const double offset = t[0];
const double scale = 4095.0 / (t[4095] - offset);
for (int n = 0; n < 4096; ++n) {
t[n] = (t[n] - offset) * scale;
table[n] = round (t[n]);
}
And change TWEAKME into 300 as value, the 'default' squash curve is about 150.
TANGELO 2.4
- added fast JPEG model based on model from paq8fthis_fast.cpp (http://cs.fit.edu/~mmahoney/compression/paq8fthis4.zip)
- this version is without APM
Nania Francesco (18th August 2013)
vista sp2 32bit:
- missing files: libstdc++-6.dll , libgcc_s_dw2-1.dll
- after downloading these 2 dll and
copying the files to the tangelo.exe it works ..
first look: it seems to be
a little bit slowly on my core2duo
but has a good compression
better than zpaq in my test - needs more testing ...
best regards
Last edited by joerg; 19th August 2013 at 06:05.
Samuraikarte (20th June 2014)
tangelo results on Silesia
Code:Silesia dicke mozil mr nci ooff osdb reym samba sao webst x-ray xml Compressor -options --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- ------------------- 37809279 2078 11228 2133 1029 1826 2168 867 2889 4283 5376 3636 291 tangelo 1.0 41267068 2246 12479 2229 1320 2051 2330 978 3116 4478 5999 3716 321 tangelo 2.0 44037765 2279 13895 2227 1580 2301 2449 1038 3298 4524 6306 3778 358 tangelo 2.3 44847833 2299 14109 2283 1635 2328 2574 1050 3337 4653 6356 3846 371 tangelo 2.1 44862127 2299 14121 2284 1631 2328 2575 1050 3343 4654 6356 3846 371 tangelo 2.4
I did a half-baked attempt at getting TANGELO to support standard input and output. Seems to work so far. Currently this is version 2.4.
https://github.com/neheb/TANGELO
edit: I should also mention that the recommended compiler switches are probably suboptimal. -O3 -msse2 seems to work best for me.