Exactly two years ago I wrote Kwc compressor for fun as spin-off from a program I wrote to count fix length keywords in text files.
Kwc take every time 6 bytes as "keyword" string and do not store double keywords in output but the unique ID from the already found keyword.
The only novel thing I put in is what I called "gear encoding" what only give gain with unsorted but slowly incurring values ranges.
It's written in Visual Basic 2008 and need Framework 3.5 to run.
It run under Windows as GUI and under command line mode 32 and 64bit.
Command line syntax:
Compress: kwc c input output
Decompress: kwc d input output
Memory usage depends on input file size and unique keywords found I think round 0,5-1,5x times input file size for compression and half of it for decompression.
It can often run recursive 3-6 times till gain stop.
I didn't build in full error handling so when there are to much round 36mln+ unique keywords found in a input file it crash with an overflow. This often happened when input file is >1GB and/or random or already compressed.
Download:
http://www.metacompressor.com/download/kwc.zip
Maybe Matt can tested it at the LTB and GCB so I can compare it against BPE2 what's also a dictionary type compressor.
Last edited by Sportman; 21st January 2010 at 06:49.
Enwik 8: 100 000 000 -> 54 097 740 (45,481 s), 2 510 921 unique keywords
Enwik 8.kwc: 54 097 740 -> 53 960 841 (26,537 s), 8 616 190 unique keywords
Enwik 9: 1 000 000 000 -> 532 622 518 (613,085 s), 10 587 596 unique keywords
Enwik 9.kwc: crashed cca in half because of abovementioned problem (37M codewords)
Did not try decompression. C2Q Q6600 @ 3.0GHz, Win 7 Pro x64
Last edited by Black_Fox; 19th January 2010 at 15:27.
Black_fox thanks for testing.
I removed that 37mln unique keywords overflow limit and replaced old version with it. It give now memory overflow by 47.9mln unique keywords under Windows 64bit even when there is ten times more memory available, this looks like .NET related.
Decompression works. http://mattmahoney.net/dc/text.html#5326
dict: 5 seconds, 52,795,972 bytesQuote Originally Posted by Black_Fox View PostEnwik 8: 100 000 000 -> 54 097 740 (45,481 s), 2 510 921 unique keywords
Here are my results on enwik8. RAM Disk used. Tested on AMD64 4000+, XP SP3.
But I'm curious. Bulat's result is 52 795 972 bytes. I have 3 different versions of dict. One have been taken from Haskell site, another one from here and last one from freearc.org. None of them gives me 52 795 972 bytes so the question is: or we have different enwiks or there is some secret dict version?Code:tool comp. decomp. size ----- ------ ------- ---------- kwc 20.627 9.228 54 097 740 dict 7.173 1.013 52 812 125
Matt, thanks for testing!
I have also tested two versions of Dict (8/6/2007 and 3/13/2009), both give the same output size but different then the two mentioned output sizes. The 3/13/2009 Dict version with was a little slower then the older version.
dict -1 74,710,608 bytes
dict -f 54,763,991 bytes
kwc c 54,097,740 bytes
dict -p 53,897,083 bytes
See link for timings and memory use:
http://www.metacompressor.com/upload...estfile=enwik8
Both Dict versions crash while compressing enwik9 with all three settings.
Is there a compiled bpe2 version for Windows 32/64bit?
http://mattmahoney.net/dc/bpe2v2.cpp
1. i've used arc -mdict actually
2. last update to dict was Aug 2 - fixed crashing bug
3. there are internal settings that cannot be controlled in dict.exe but in -mdict:
MinWeakChars = 20;
MinLargeCnt = 2048;
MinMediumCnt = 100;
MinSmallCnt = 50;
MinRatio = 4;
case 'p': p->MinLargeCnt=8192; p->MinMediumCnt=400; p->MinSmallCnt=100; p->MinRatio=4; continue;
case 'f': p->MinLargeCnt=2048; p->MinMediumCnt=100; p->MinSmallCnt= 50; p->MinRatio=0; continue;
case 'c': p->MinWeakChars = parseInt (param+1, &error); continue;
case 'l': p->MinLargeCnt = parseInt (param+1, &error); continue;
case 'm': p->MinMediumCnt = parseInt (param+1, &error); continue;
case 's': p->MinSmallCnt = parseInt (param+1, &error); continue;
case 'r': p->MinRatio = parseInt (param+1, &error); continue;
default settings are optimized to improve dict+lzma compression. by changing these settings we can improve compression somewhat. the best i can do
D:\testing>arc create a -mdict:l1000:m39:s25:r1:33m enwik8
Compressed 1 file, 100,000,000 => 50,263,682 bytes. Ratio 50.2%
Compression time: cpu 4.13 secs, real 4.53 secs. Speed 22,080 kB/s
making :r floating point will probably allow to break 50 mb barrier :)
you forget to test it w/o any switches :)Quote Originally Posted by Sportman View Postdict -1 74,710,608 bytes
dict -f 54,763,991 bytes
dict -p 53,897,083 bytes
Last edited by Bulat Ziganshin; 20th January 2010 at 15:25.
D:\testing>arc create a -mdict enwik9
Compressed 1 file, 1,000,000,000 => 506,229,029 bytes. Ratio 50.6%
Compression time: cpu 44.57 secs, real 48.01 secs. Speed 20,829 kB/s
D:\testing>arc create a -mdict:l1000:m39:s26:r1:33m enwik9
Compressed 1 file, 1,000,000,000 => 484,991,633 bytes. Ratio 48.4%
Compression time: cpu 48.22 secs, real 53.77 secs. Speed 18,599 kB/s
i don't tried to optimize it for e9, just tried switches that was semi-optimal for e8
Oops, quickly added:Quote Originally Posted by Bulat Ziganshin View Postyou forget to test it w/o any switches :)
dict -1 74,710,608 bytes
dict -f 54,763,991 bytes
kwc c 54,097,740 bytes
dict -p 53,897,083 bytes
dict 52,812,712 bytes
arc a -mdict 52,796,199 bytes
arc a -mdict:l1000:m39:s26:r1:33m 50,292,278 bytes
See link for timings and memory use:
http://www.metacompressor.com/upload...estfile=enwik8
Also without any switch both crash at enwik9
I also added arc a -mdict and arc a -mdict:l1000:m39:s26:r1:33m at enwik8 and enwik9:
http://www.metacompressor.com/upload...testfile=file4
When I test Dict and Arc -mdict at rafale.bmp I get not so good compression results compared with Kwc:
4,066,106 bytes dict -p
3,487,657 bytes dict -1
3,468,906 bytes arc a -mdict
3,468,672 bytes dict
3,259,972 bytes arc a -mdict:l1000:m39:s26:r1:33m
1,849,159 bytes kwc c
See link for timings and memory use:
http://www.metacompressor.com/upload...estfile=rafale
What kind of dictionary does Dict build?
Last edited by Sportman; 20th January 2010 at 17:49.
Original file:
Mike (14th October 2024)