Dynamic record model
Text/utf detection
dynamic dict preprocess (modified version of XWRT)
0xX0X0X0X0... to 0xXXXX... filter for text
Code:enwik8 test (option -3) compressed time paq8pxd 20337801 1229 paq8px_v69 20794944 1797 (option -7) compressed time paq8pxd 17596170 11464 paq8px_v69 17939198 15363
Last edited by kaitz; 22nd January 2012 at 12:42. Reason: test result
KZo
For texts? Ok, thank U, I will give it a try!
Thanks!!
Calgary corpus results (14 files to 1 archive). Unfortunately, results are worse than paq8px_v69
Code:D:\>paq8px -7 calgary-7 c:\res\calgary\* Creating archive calgary-7.paq8px with 14 file(s)... 1/14 Filename: c:/res/calgary/BIB (111261 bytes) Block segmentation: 0 | default | 111261 bytes [0 - 111260] Compressed from 111261 to 20635 bytes. 2/14 Filename: c:/res/calgary/BOOK1 (768771 bytes) Block segmentation: 0 | default | 768771 bytes [0 - 768770] Compressed from 768771 to 191178 bytes. 3/14 Filename: c:/res/calgary/BOOK2 (610856 bytes) Block segmentation: 0 | default | 610856 bytes [0 - 610855] Compressed from 610856 to 116216 bytes. 4/14 Filename: c:/res/calgary/GEO (102400 bytes) Block segmentation: 0 | default | 102400 bytes [0 - 102399] Compressed from 102400 to 44094 bytes. 5/14 Filename: c:/res/calgary/NEWS (377109 bytes) Block segmentation: 0 | default | 377109 bytes [0 - 377108] Compressed from 377109 to 82789 bytes. 6/14 Filename: c:/res/calgary/OBJ1 (21504 bytes) Block segmentation: 0 | default | 21504 bytes [0 - 21503] Compressed from 21504 to 7280 bytes. 7/14 Filename: c:/res/calgary/OBJ2 (246814 bytes) Block segmentation: 0 | default | 246814 bytes [0 - 246813] Compressed from 246814 to 44111 bytes. 8/14 Filename: c:/res/calgary/PAPER1 (53161 bytes) Block segmentation: 0 | default | 53161 bytes [0 - 53160] Compressed from 53161 to 10389 bytes. 9/14 Filename: c:/res/calgary/PAPER2 (82199 bytes) Block segmentation: 0 | default | 82199 bytes [0 - 82198] Compressed from 82199 to 16461 bytes. 10/14 Filename: c:/res/calgary/PIC (513216 bytes) Block segmentation: 0 | default | 513216 bytes [0 - 513215] Compressed from 513216 to 30828 bytes. 11/14 Filename: c:/res/calgary/PROGC (39611 bytes) Block segmentation: 0 | default | 39611 bytes [0 - 39610] Compressed from 39611 to 8218 bytes. 12/14 Filename: c:/res/calgary/PROGL (71646 bytes) Block segmentation: 0 | default | 71646 bytes [0 - 71645] Compressed from 71646 to 9503 bytes. 13/14 Filename: c:/res/calgary/PROGP (49379 bytes) Block segmentation: 0 | default | 49379 bytes [0 - 49378] Compressed from 49379 to 6688 bytes. 14/14 Filename: c:/res/calgary/TRANS (93695 bytes) Block segmentation: 0 | default | 93695 bytes [0 - 93694] Compressed from 93695 to 9965 bytes. Total 3141622 bytes compressed to 598550 bytes. Time 458.69 sec, used 811717915 bytes of memory D:\>paq8pxd -7 calgary-7d c:\res\calgary\* Creating archive calgary-7d.paq8pxd with 14 file(s)... File list (169 bytes) Compressed from 169 to 100 bytes. 1/14 Filename: c:/res/calgary/BIB (111261 bytes) Block segmentation: 0 | text | 111261 bytes [0 - 111260] (wrt: 85981) Compressed from 111261 to 20801 bytes. 2/14 Filename: c:/res/calgary/BOOK1 (768771 bytes) Block segmentation: 0 | text | 173891 bytes [0 - 173890] (wrt: 131001) 1 | default | 1 bytes [173891 - 173891] 2 | text | 249971 bytes [173892 - 423862] (wrt: 183473) 3 | default | 1 bytes [423863 - 423863] 4 | text | 344907 bytes [423864 - 768770] (wrt: 241344) Compressed from 768771 to 197989 bytes. 3/14 Filename: c:/res/calgary/BOOK2 (610856 bytes) Block segmentation: 0 | text | 610856 bytes [0 - 610855] (wrt: 375627) Compressed from 610856 to 119444 bytes. 4/14 Filename: c:/res/calgary/GEO (102400 bytes) Block segmentation: 0 | default | 102400 bytes [0 - 102399] Compressed from 102400 to 44145 bytes. 5/14 Filename: c:/res/calgary/NEWS (377109 bytes) Block segmentation: 0 | text | 314908 bytes [0 - 314907] (wrt: 238603) 1 | default | 1 bytes [314908 - 314908] 2 | text | 3959 bytes [314909 - 318867] 3 | default | 1 bytes [318868 - 318868] 4 | text | 58240 bytes [318869 - 377108] (wrt: 49491) Compressed from 377109 to 88660 bytes. 6/14 Filename: c:/res/calgary/OBJ1 (21504 bytes) Block segmentation: 0 | default | 21504 bytes [0 - 21503] Compressed from 21504 to 7341 bytes. 7/14 Filename: c:/res/calgary/OBJ2 (246814 bytes) Block segmentation: 0 | default | 246814 bytes [0 - 246813] Compressed from 246814 to 44003 bytes. 8/14 Filename: c:/res/calgary/PAPER1 (53161 bytes) Block segmentation: 0 | text | 53161 bytes [0 - 53160] (wrt: 40392) Compressed from 53161 to 11592 bytes. 9/14 Filename: c:/res/calgary/PAPER2 (82199 bytes) Block segmentation: 0 | text | 82199 bytes [0 - 82198] (wrt: 59795) Compressed from 82199 to 18096 bytes. 10/14 Filename: c:/res/calgary/PIC (513216 bytes) Block segmentation: 0 | default | 513216 bytes [0 - 513215] Compressed from 513216 to 38731 bytes. 11/14 Filename: c:/res/calgary/PROGC (39611 bytes) Block segmentation: 0 | text | 39611 bytes [0 - 39610] (wrt: 31184) Compressed from 39611 to 8748 bytes. 12/14 Filename: c:/res/calgary/PROGL (71646 bytes) Block segmentation: 0 | text | 71646 bytes [0 - 71645] (wrt: 52840) Compressed from 71646 to 9693 bytes. 13/14 Filename: c:/res/calgary/PROGP (49379 bytes) Block segmentation: 0 | text | 49379 bytes [0 - 49378] (wrt: 36331) Compressed from 49379 to 6828 bytes. 14/14 Filename: c:/res/calgary/TRANS (93695 bytes) Block segmentation: 0 | default | 93695 bytes [0 - 93694] Compressed from 93695 to 10238 bytes. Total 3141622 bytes compressed to 626419 bytes. Time 442.42 sec, used 812319808 bytes of memory
Hi there,
I am currently running paq8pxd -7 on my testsets. So far, the results are not better on textual data.
Here is the output for the wikipedia testset:
1/1 Filename: wiki.tar (1000009216 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 75673565)
2 | default | 768 bytes [100001024 - 100001791]
3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 67339901)
4 | default | 768 bytes [200001792 - 200002559]
5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 61950819)
6 | default | 768 bytes [300002560 - 300003327]
7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 65501064)
8 | default | 768 bytes [400003328 - 400004095]
9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 65785770)
10 | default | 768 bytes [500004096 - 500004863]
11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 6164441
12 | default | 769 bytes [600004863 - 600005631]
13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 66960493)
14 | default | 768 bytes [700005632 - 700006399]
15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 85100616)
16 | default | 768 bytes [800006400 - 800007167]
17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 70613645)
18 | default | 770 bytes [900007166 - 900007935]
19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
20 | default | 1280 bytes [1000007936 - 1000009215]
Compressed from 1000009216 to 143474038 bytes.
Total 1000009216 bytes compressed to 143474071 bytes.
Time 99227.93 sec, used 812320083 bytes of memory
@Stephan
Each wrt block has its own dict. Probably this is cause.
Updated version. No progname change. Previous attempt is obsolete.
On enwik8 compression time is about same as in paq8p3
Attachment has some testing results. And yes drt+px_v69 has better results on enwik8 then pxd.Code:opt -7 Compression Time px_v69 pxd diff px_v69 pxd enwik6 207610 206343 1267 155 101 world95.txt 351923 350288 1635 451 224 calgary.tar 598118 607317 -9199 457 371 enwik8 17939198 17511910 427288 15363 8238 vlcfile 1634624 1632802 1822 3004 1676
EDIT:
Tested in another pc. (Core2Duo T8300 2.4GHz 2GB RAM)
Code:paq8pxd -7 enwik9 144773408 63302Code:paq8pxd -8 enwik8 17300285 8137(sec) 1626035957(mem)
Last edited by kaitz; 1st February 2012 at 16:05. Reason: more test results
KZo
Hi Kaido,
v1 compresses slightly better.
1/1 Filename: wiki.tar (1000009216 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 77295703)
2 | default | 768 bytes [100001024 - 100001791]
3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 69018975)
4 | default | 768 bytes [200001792 - 200002559]
5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 64253847)
6 | default | 768 bytes [300002560 - 300003327]
7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 67024667)
8 | default | 768 bytes [400003328 - 400004095]
9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 67571240)
10 | default | 768 bytes [500004096 - 500004863]
11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 63552163)
12 | default | 769 bytes [600004863 - 600005631]
13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 6880379
14 | default | 768 bytes [700005632 - 700006399]
15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 86301047)
16 | default | 768 bytes [800006400 - 800007167]
17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 72774055)
18 | default | 770 bytes [900007166 - 900007935]
19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
20 | default | 1280 bytes [1000007936 - 1000009215]
Compressed from 1000009216 to 143021889 bytes.
But it doesn't seem to detect 24-bit images - only 8bit seem to be detected:
1/1 Filename: bmp2.tar (633510400 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | hdr | 17 bytes [1024 - 1040]
2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
3 | default | 18876399 bytes [6292497 - 25168895]
4 | hdr | 17 bytes [25168896 - 25168912]
5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
6 | default | 117160751 bytes [64221905 - 181382655]
7 | hdr | 17 bytes [181382656 - 181382672]
8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
9 | default | 83103039 bytes [209083073 - 292186111]
Compressing... 41.54%
// Detect .pbm .pgm .ppm image //fails on enwik9 at offset 435132165 (24 bit header )
KZo
In my tests I found that PAQ, compression becomes greater when:
- Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension
- Within the list of files with the same extension, compression will be greater if the files are in size order, First the largest file, and finally the smallest file.
These latest versions of the PAQ does not give option to change the order of input files, just attack any folder compressing the files in alphabetical order.
PAQ8pxd_v1 compresses the bitmap testset about 30 MB worse and does not detect all 24-bit images.
-there is no default data in this testset.
1/1 Filename: bmp2.tar (633510400 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | hdr | 17 bytes [1024 - 1040]
2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
3 | default | 18876399 bytes [6292497 - 25168895]
4 | hdr | 17 bytes [25168896 - 25168912]
5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
6 | default | 117160751 bytes [64221905 - 181382655]
7 | hdr | 17 bytes [181382656 - 181382672]
8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
9 | default | 83103039 bytes [209083073 - 292186111]
10 | hdr | 17 bytes [292186112 - 292186128]
11 | 8b-image | 11130701 bytes [292186129 - 303316829] (width: 2749)
12 | default | 33393314 bytes [303316830 - 336710143]
13 | hdr | 17 bytes [336710144 - 336710160]
14 | 8b-image | 6016000 bytes [336710161 - 342726160] (width: 2000)
15 | default | 15999209 bytes [342726161 - 358725369]
16 | hdr | 18 bytes [358725370 - 358725387]
17 | 8b-image | 2048 bytes [358725388 - 358727435] (width: 512)
18 | default | 498916 bytes [358727436 - 359226351]
19 | hdr | 18 bytes [359226352 - 359226369]
20 | 8b-image | 5658 bytes [359226370 - 359232027] (width: 1)
21 | default | 518977 bytes [359232028 - 359751004]
22 | hdr | 18 bytes [359751005 - 359751022]
23 | 24b-image | 57600 bytes [359751023 - 359808622] (width: 15)
24 | default | 263779 bytes [359808623 - 360072401]
25 | hdr | 18 bytes [360072402 - 360072419]
26 | 24b-image | 30957768 bytes [360072420 - 391030187] (width: 3897)
27 | default | 12457556 bytes [391030188 - 403487743]
28 | hdr | 17 bytes [403487744 - 403487760]
29 | 8b-image | 7375872 bytes [403487761 - 410863632] (width: 3136)
30 | default | 22129647 bytes [410863633 - 432993279]
31 | hdr | 17 bytes [432993280 - 432993296]
32 | 8b-image | 3429216 bytes [432993297 - 436422512] (width: 226
33 | default | 10289295 bytes [436422513 - 446711807]
34 | hdr | 17 bytes [446711808 - 446711824]
35 | 8b-image | 6291456 bytes [446711825 - 453003280] (width: 3072)
36 | default | 18876399 bytes [453003281 - 471879679]
37 | hdr | 17 bytes [471879680 - 471879696]
38 | 8b-image | 6016000 bytes [471879697 - 477895696] (width: 300
39 | default | 18050031 bytes [477895697 - 495945727]
40 | hdr | 17 bytes [495945728 - 495945744]
41 | 8b-image | 6016000 bytes [495945745 - 501961744] (width: 300
42 | default | 18050031 bytes [501961745 - 520011775]
43 | hdr | 17 bytes [520011776 - 520011792]
44 | 8b-image | 7375872 bytes [520011793 - 527387664] (width: 3136)
45 | default | 22129647 bytes [527387665 - 549517311]
46 | hdr | 17 bytes [549517312 - 549517328]
47 | 8b-image | 7375872 bytes [549517329 - 556893200] (width: 3136)
48 | default | 11772012 bytes [556893201 - 568665212]
49 | hdr | 18 bytes [568665213 - 568665230]
50 | 24b-image | 9984 bytes [568665231 - 568675214] (width: 9984)
51 | default | 10347633 bytes [568675215 - 579022847]
52 | hdr | 17 bytes [579022848 - 579022864]
53 | 8b-image | 12121088 bytes [579022865 - 591143952] (width: 4256)
54 | default | 36365295 bytes [591143953 - 627509247]
55 | hdr | 17 bytes [627509248 - 627509264]
56 | 8b-image | 6000000 bytes [627509265 - 633509264] (width: 3000)
57 | default | 1135 bytes [633509265 - 633510399]
Compressed from 633510400 to 269355793 bytes.
Total 633510400 bytes compressed to 269355825 bytes.
Time 64796.59 sec, used 881831523 bytes of memory
This brings up a question I have: If the //Detect fails to match any known filestream (PGM,BMP etc), instead of default mode, would anyone want a mode that falls back to an uncompressed format mode like PPM or BMP? This may not be possible, since odd stream lengths could not be a valid bitmap stream.Quote Originally Posted by BetaTester View PostIn my tests I found that PAQ, compression becomes greater when:
- Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension
Image detection is back, so do not try to compress enwik9.Code:enwik8 -7 17045653 Time 9428.17 sec, used 853029829 bytes of memory -8 16848214 Time 9535.25 sec, used 1658336197 bytes of memory
KZo
I posted your results. http://mattmahoney.net/dc/text.html#1448
I wonder if with -8 you might be able to move up to the #3 spot.
For detection, why don't used the extension format in first ?
This image won't be see as jpeg
Name: img.jpg Views: 16841 Size: 2.6 KB
But this image would be see as an jpeg image :
Name: img (1).jpg Views: 17303 Size: 2.6 KB
And, if you want to compress theses images :
"
Files list <14 bytes>
Compressed from 14 to 17 bytes.
"
Maybe it's will be possible to not compress if after compress it's not thinner ?
New
- Modified im8model (faster/slightly better in my tests)
- base64 in e-mails (recursion. yes, it can fail on transform)
- fixed enwik9 img problem (i hope)
Code:option -7 zone_plate.pgm 6000017 Compressed Time paq8px_v69 404964 238 paq8pxd_v3 355834 225 hdr.pgm 6291473 paq8px_v69 1556167 178 paq8pxd_v3 1553288 173 bridge.pgm whas about 1700 bytes larger with paq8pxd_v3Code:Thunderbird inbox 30118877 bytes option -7 paq8pxd_v3 17364195 bytes, time 4207.49 sec, 827511670 mem paq8px_v69 17677583 bytes, time 5001.42 sec, 811820566 memCode:paq8pxd_v3 -8 enwik8 16847903 bytes Time 8300 sec, used 1658336197 mem paq8pxd_v3 -8 enwik9 136777893 bytes Time 82822 sec, used 1658336197 mem paq8pxd_v3 -7 enwik8 17045354 bytes Time 8023 sec, used 853029829 mem paq8pxd_v3 -7 enwik9 140110094 bytes Time 80069 sec, used 853029829 mem
Last edited by kaitz; 26th February 2012 at 20:31. Reason: base64 test & enwik8/9 tests
KZo
Sample file has multi-base64 encoded data. From web.
- Added 4bit bmp
- base64 fixes
- other fixes
- combined wrt files to one
- etc.
KZo
Is it possible to add 32-bit image filters?
v3 results are posted to http://mattmahoney.net/dc/text.html#1368
Somehow this escaped my attention when you released it. It is now #3, beating lpaq9m.
Anyway, if you want to test v4 on enwik9 I will post it too. I am testing on silesia. So far it is 1958K on dickens, beating paq8px_v69.
Edit: paq8pxd_v4 -8 takes the top position on the silesia benchmark. http://mattmahoney.net/dc/silesia.html
Compression took 15 hours on a 2 GHz T3200. Testing decompression now.
Compression was better on most files but somewhat worse on samba. Here is the output.
Edit: decompression checks OK. Decompression took 9 hours. Looking back at compression times, it was also 9 hours, not 15. My bad.Code:D:\silesia>for %i in (*.) do paq8pxd_v4 -8 %i D:\silesia>paq8pxd_v4 -8 dickens Creating archive dickens.paq8pxd with 1 file(s)... File list (18 bytes) Compressed from 18 to 20 bytes. 1/1 Filename: dickens (10192446 bytes) Block segmentation: 0 | text | 10192446 bytes [0 - 10192445] (wrt: 6006941) Compressed from 10192446 to 1958629 bytes. Total 10192446 bytes compressed to 1958659 bytes. Time 1124.14 sec, used 1633424260 bytes of memory D:\silesia>paq8pxd_v4 -8 mozilla Creating archive mozilla.paq8pxd with 1 file(s)... File list (18 bytes) Compressed from 18 to 20 bytes. 1/1 Filename: mozilla (51220480 bytes) Block segmentation: 0 | default | 16003634 bytes [0 - 16003633] 1 | text | 565152 bytes [16003634 - 16568785] (wrt: 392572) 2 | default | 33443184 bytes [16568786 - 50011969] 3 | utf-8 | 575462 bytes [50011970 - 50587431] (wrt: 468965) 4 | default | 51416 bytes [50587432 - 50638847] 5 | jpeg | 9407 bytes [50638848 - 50648254] 6 | default | 833 bytes [50648255 - 50649087] 7 | jpeg | 49629 bytes [50649088 - 50698716] 8 | default | 547 bytes [50698717 - 50699263] 9 | hdr | 44 bytes [50699264 - 50699307] 10 | audio | 27760 bytes [50699308 - 50727067] (8b mono) 11 | default | 493412 bytes [50727068 - 51220479] Compressed from 51220480 to 10229462 bytes. Total 51220480 bytes compressed to 10229492 bytes. Time 9270.58 sec, used 1862708696 bytes of memory D:\silesia>paq8pxd_v4 -8 mr Creating archive mr.paq8pxd with 1 file(s)... File list (12 bytes) Compressed from 12 to 15 bytes. 1/1 Filename: mr (9970564 bytes) Block segmentation: 0 | default | 9970564 bytes [0 - 9970563] Compressed from 9970564 to 2060422 bytes. Total 9970564 bytes compressed to 2060447 bytes. Time 1349.45 sec, used 1565701897 bytes of memory D:\silesia>paq8pxd_v4 -8 nci Creating archive nci.paq8pxd with 1 file(s)... File list (14 bytes) Compressed from 14 to 16 bytes. 1/1 Filename: nci (33553445 bytes) Block segmentation: 0 | text | 33553445 bytes [0 - 33553444] Compressed from 33553445 to 923150 bytes. Total 33553445 bytes compressed to 923176 bytes. Time 4439.04 sec, used 1633424264 bytes of memory D:\silesia>paq8pxd_v4 -8 ooffice Creating archive ooffice.paq8pxd with 1 file(s)... File list (17 bytes) Compressed from 17 to 19 bytes. 1/1 Filename: ooffice (6152192 bytes) Block segmentation: 0 | default | 4228 bytes [0 - 4227] 1 | exe | 5012819 bytes [4228 - 5017046] 2 | default | 26830 bytes [5017047 - 5043876] 3 | exe | 253183 bytes [5043877 - 5297059] 4 | default | 855132 bytes [5297060 - 6152191] Compressed from 6152192 to 1418239 bytes. Total 6152192 bytes compressed to 1418268 bytes. Time 1118.72 sec, used 1582557220 bytes of memory D:\silesia>paq8pxd_v4 -8 osdb Creating archive osdb.paq8pxd with 1 file(s)... File list (15 bytes) Compressed from 15 to 17 bytes. 1/1 Filename: osdb (10085684 bytes) Block segmentation: 0 | default | 10085684 bytes [0 - 10085683] Compressed from 10085684 to 2069934 bytes. Total 10085684 bytes compressed to 2069961 bytes. Time 1776.86 sec, used 1565701895 bytes of memory D:\silesia>paq8pxd_v4 -8 reymont Creating archive reymont.paq8pxd with 1 file(s)... File list (17 bytes) Compressed from 17 to 19 bytes. 1/1 Filename: reymont (6627202 bytes) Block segmentation: 0 | text | 6501239 bytes [0 - 6501238] 1 | default | 125963 bytes [6501239 - 6627201] Compressed from 6627202 to 812189 bytes. Total 6627202 bytes compressed to 812218 bytes. Time 1064.92 sec, used 1633426308 bytes of memory D:\silesia>paq8pxd_v4 -8 samba Creating archive samba.paq8pxd with 1 file(s)... File list (16 bytes) Compressed from 16 to 18 bytes. 1/1 Filename: samba (21606400 bytes) Block segmentation: 0 | default | 279004 bytes [0 - 279003] 1 | text | 1658664 bytes [279004 - 1937667] (wrt: 1040945) 2 | default | 131757 bytes [1937668 - 2069424] 3 | text | 2661772 bytes [2069425 - 4731196] (wrt: 1699529) 4 | default | 1092855 bytes [4731197 - 5824051] 5 | text | 725004 bytes [5824052 - 6549055] (wrt: 562098) 6 | default | 420432 bytes [6549056 - 6969487] 7 | jpeg | 8020 bytes [6969488 - 6977507] 8 | default | 461300 bytes [6977508 - 7438807] 9 | text | 678792 bytes [7438808 - 8117599] (wrt: 554030) 10 | default | 9673 bytes [8117600 - 8127272] 11 | text | 13132289 bytes [8127273 - 21259561] (wrt: 9307955) 12 | default | 346838 bytes [21259562 - 21606399] Compressed from 21606400 to 2853155 bytes. Total 21606400 bytes compressed to 2853183 bytes. Time 2701.86 sec, used 1794596602 bytes of memory D:\silesia>paq8pxd_v4 -8 sao Creating archive sao.paq8pxd with 1 file(s)... File list (13 bytes) Compressed from 13 to 16 bytes. 1/1 Filename: sao (7251944 bytes) Block segmentation: 0 | default | 7251944 bytes [0 - 7251943] Compressed from 7251944 to 3776301 bytes. Total 7251944 bytes compressed to 3776327 bytes. Time 1378.77 sec, used 1565701896 bytes of memory D:\silesia>paq8pxd_v4 -8 webster Creating archive webster.paq8pxd with 1 file(s)... File list (18 bytes) Compressed from 18 to 21 bytes. 1/1 Filename: webster (41458703 bytes) Block segmentation: 0 | text | 41458703 bytes [0 - 41458702] (wrt: 29889928) Compressed from 41458703 to 4907154 bytes. Total 41458703 bytes compressed to 4907185 bytes. Time 5363.06 sec, used 1633424260 bytes of memory D:\silesia>paq8pxd_v4 -8 x-ray Creating archive x-ray.paq8pxd with 1 file(s)... File list (15 bytes) Compressed from 15 to 17 bytes. 1/1 Filename: x-ray (8474240 bytes) Block segmentation: 0 | default | 8474240 bytes [0 - 8474239] Compressed from 8474240 to 3587948 bytes. Total 8474240 bytes compressed to 3587975 bytes. Time 1407.96 sec, used 1565701894 bytes of memory D:\silesia>paq8pxd_v4 -8 xml Creating archive xml.paq8pxd with 1 file(s)... File list (13 bytes) Compressed from 13 to 15 bytes. 1/1 Filename: xml (5345280 bytes) Block segmentation: 0 | text | 5345279 bytes [0 - 5345278] (wrt: 3560922) 1 | default | 1 bytes [5345279 - 5345279] Compressed from 5345280 to 264663 bytes. Total 5345280 bytes compressed to 264688 bytes. Time 609.45 sec, used 1633426312 bytes of memory D:\silesia>dir Volume in drive D is DATA Volume Serial Number is 5CE8-C77D Directory of D:\silesia 04/20/2012 04:23 AM <DIR> . 04/20/2012 04:23 AM <DIR> .. 04/12/2002 01:21 PM 10,192,446 dickens 04/19/2012 08:05 PM 1,958,659 dickens.paq8pxd 05/31/2002 07:50 PM 51,220,480 mozilla 04/19/2012 10:40 PM 10,229,492 mozilla.paq8pxd 03/20/2003 11:12 AM 9,970,564 mr 04/19/2012 11:02 PM 2,060,447 mr.paq8pxd 04/02/2002 10:21 PM 33,553,445 nci 04/20/2012 12:16 AM 923,176 nci.paq8pxd 07/04/2002 05:00 AM 6,152,192 ooffice 04/20/2012 12:35 AM 1,418,268 ooffice.paq8pxd 04/11/2002 06:56 PM 10,085,684 osdb 04/20/2012 01:05 AM 2,069,961 osdb.paq8pxd 04/02/2002 11:40 PM 6,627,202 reymont 04/20/2012 01:22 AM 812,218 reymont.paq8pxd 03/25/2002 02:34 PM 21,606,400 samba 04/20/2012 02:07 AM 2,853,183 samba.paq8pxd 03/24/2002 01:38 AM 7,251,944 sao 04/20/2012 02:30 AM 3,776,327 sao.paq8pxd 03/25/2002 10:39 AM 41,458,703 webster 04/20/2012 04:00 AM 4,907,185 webster.paq8pxd 04/04/2002 02:00 PM 8,474,240 x-ray 04/20/2012 04:23 AM 3,587,975 x-ray.paq8pxd 12/01/2000 12:54 AM 5,345,280 xml 04/20/2012 04:33 AM 264,688 xml.paq8pxd 24 File(s) 246,800,159 bytes 2 Dir(s) 39,701,053,440 bytes free
Last edited by Matt Mahoney; 21st April 2012 at 03:46.
Code:paq8pxd_v4 -8 enwik9 Total 1000000000 bytes compressed to 135027170 bytes. Time 88409.59 sec, used 1633424261 bytes of memory paq8pxd_v4 -8 enwik8 Total 100000000 bytes compressed to 16642941 bytes. Time 8395.20 sec, used 1633424261 bytes of memory
KZo
Nice improvement on LTCB. http://mattmahoney.net/dc/text.html#1350
I also tested on the Silesia benchmark with -5 through -8. http://mattmahoney.net/dc/silesia.html
It seems that multiple wrt blocks is messing up modeling in 'samba'. Maybe building one dict for whole file will improve results.
Like building dict from all text blocks and using that for each text block.
KZo
Please help me, I can compile paq8px but I can't compile paq8pxd.
when compile it error that
g++ paq8pxd_v4.cpp -DUNIX -DNOASM -O3 -s -march=nocona -O2 -pipe -o paq8pxd_v4
In file included from paq8pxd_v4.cpp:4393:
wrtpre.cpp: In function ?int min(int, int)?:
wrtpre.cpp:39: error: redefinition of ?int min(int, int)?
paq8pxd_v4.cpp:547: error: ?int min(int, int)? previously defined here
wrtpre.cpp: In function ?int max(int, int)?:
wrtpre.cpp:40: error: redefinition of ?int max(int, int)?
paq8pxd_v4.cpp:548: error: ?int max(int, int)? previously defined here
paq8pxd_v4.cpp: In function ?void compressRecursive(FILE*, long int, Encoder&, char*, int, int, int)?:
paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
paq8pxd_v4.cpp: In function ?int main(int, char**)?:
paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
Last edited by Anitatoom; 1st November 2012 at 10:32.
Remove min()/max() from wrtpre.cpp
Thank you :D
I am working on newer version.
Numbers in wrt preprocessing are treated as a-z.
nci, xml from Silesia corpus
There is mistake in displaying if text is wrt processed.Code:D:\test>paq8pxd_v6.exe -7 nciCreating archive nci.paq8pxd with 1 file(s)... File list (14 bytes) Compressed from 14 to 16 bytes. 1/1 Filename: nci (33553445 bytes) Block segmentation: 0 | text | 33553445 bytes [0 - 33553444] Compressed from 33553445 to 850808 bytes. Total 33553445 bytes compressed to 850834 bytes. Time 4182.96 sec, used 844903304 bytes of memory D:\test>paq8pxd_v6.exe -7 xml Creating archive xml.paq8pxd with 1 file(s)... File list (13 bytes) Compressed from 13 to 15 bytes. 1/1 Filename: xml (5345280 bytes) Block segmentation: 0 | text | 5345279 bytes [0 - 5345278] (wrt: 3546174) 1 | default | 1 bytes [5345279 - 5345279] Compressed from 5345280 to 264647 bytes. Total 5345280 bytes compressed to 264672 bytes. Time 689.62 sec, used 844905352 bytes of memory
Some text files suffer slight loss of compression.
KZo
That's a huge improvement on nci.
Dict. mostly contains numbers. I limited numwords to minimum 3 bytes. (excluding 34,3g etc, allowing f4,k2 etc)Quote Originally Posted by Matt Mahoney View PostThat's a huge improvement on nci.
world95.txt suffers little less and nci loses about 6 kb.
I am no expert so i can only guess how much can nci be actually compressed.
KZo
I have so far not found any documentation on the nci file format but I did find http://cactus.nci.nih.gov/ncidb2.1/ which apparently lets you query the same database to show chemical structures by NSC number. For example, the first record in the file is:
The first number (155542) is the NSC number, which you can search and show the chemical structure.Code:155542 ROtclserve11150011212D 0 0.00000 0.000001049521 31 32 0 0 0 0 0 0 0 0 1 V2000 2.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 0.0000 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0 3.0000 1.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 1.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.5981 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.5981 2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 3.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -1.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1340 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1340 -2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -3.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 -2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 0.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3800 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -0.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2.4631 1.3100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 0.3800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 6.1350 1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 6.1350 2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 3.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.3291 2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.5970 -1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.5970 -2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -3.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.4030 -2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.4030 -1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 3 4 1 0 0 0 0 4 5 2 0 0 0 0 5 6 1 0 0 0 0 6 7 2 0 0 0 0 7 8 1 0 0 0 0 8 9 2 0 0 0 0 4 9 1 0 0 0 0 2 10 1 0 0 0 0 10 11 3 0 0 0 0 2 12 1 0 0 0 0 12 13 2 0 0 0 0 13 14 1 0 0 0 0 14 15 2 0 0 0 0 15 16 1 0 0 0 0 16 17 2 0 0 0 0 12 17 1 0 0 0 0 1 18 1 0 0 0 0 1 19 1 0 0 0 0 1 20 1 0 0 0 0 3 21 1 0 0 0 0 5 22 1 0 0 0 0 6 23 1 0 0 0 0 7 24 1 0 0 0 0 8 25 1 0 0 0 0 9 26 1 0 0 0 0 13 27 1 0 0 0 0 14 28 1 0 0 0 0 15 29 1 0 0 0 0 16 30 1 0 0 0 0 17 31 1 0 0 0 0 M END > <NSC> 155542 > <CAS_RN> 17424-68-9 > <SMILES> C[C](NC1=CC=CC=C1)(C#N)C2=CC=CC=C2 > <HASH> 4d5775e8d9fc4fd3 $$$$
The numbers 31 32 are the number of atoms and the number of chemical bonds. The next 2 sections have 31 and 32 lines respectively. I don't know what the rest of the line means.
Each line shows (I guess) the x,y,z coordinates of the atom for drawing a 3-D model, followed by the chemical symbol such as C for carbon. I don't know what the rest of the numbers are. I don't know why the z coordinate is always 0 because a carbon atom with 4 single bonds has a tetrahedral structure which should mean that attached atoms would have nonzero z values.
The next 32 lines show the chemical bonds. The first 2 numbers are the numbers of the atoms (from 1 to 31) and the next number is the type of bond (1 for single, 2 for double, etc). These should be predictable by the x,y,z coordinates because they should be at a characteristic distance depending on the atoms, and also by the fact that each atom has a characteristic valence or total number of bonds (4 for C, 3 for N, 2 for O and S, 1 for H, Cl, Br). I don't know what the other numbers are.
Note that the NSC number appears twice. It looks like they are numbered consecutively as well.
SMILES is a string that gives the chemical structure in a canonical form. Since the previous data gives the chemical structure, this string should be predictable. The string omits H and uses = and # to indicate double and triple bonds and numbers to indicate bonds between distant parts of the molecule (forming rings). See http://en.wikipedia.org/wiki/Simplif...e-entry_system
Edit: Notice that the atoms in the SMILES string appear in the same order as in the list of atoms above. SMILES omits H, so these appear last.
I don't know how the hash is computed, but obviously this information is redundant.
Last edited by Matt Mahoney; 24th December 2012 at 05:11.
Good analysis so far, though there seems to be more - there are some "surprising" records with slightly unexpected variations and SMILES strings, e.g. #156316:
Note the fixed 3-digit layout so that the two "line counts" are not seperated by a space ("66100" instead of "66 100"), the "M" table that usually is "M END" only, but is longer here, and the huge SMILES string containing some dots (I don't have a clue what these are for) and coded numbers ("%10") - % seems to initiate a two digit number - EDIT: OK, '%' preceding a label above 9 is described on the Wikipedia page - there also is an example containing a dot, though I haven't found an explanation yet.Code:156316 ROtclserve11150011212D 0 0.00000 0.00000115469 999-99-9 66100 0 0 0 0 0 0 0 0 7 V2000 (snip)table 1(snip) (snip)table 2(snip) M CHG 2 37 -1 40 1 M RAD 8 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 M RAD 8 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 M RAD 8 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 M RAD 8 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 M RAD 4 33 2 34 2 35 2 36 2 M END > <NSC> 156316 > <CAS_RN> 999-99-9 > <SMILES> [Cl-].CC[N+](CC)(CC)CC.[Cl]1[Nb]234567[Cl][Nb]289%10%11%12[Cl][Nb]138%13%14%15[Cl][Nb]4%13%16%17%18([Cl]5)[Cl][Nb]69%16%19([Cl]7)([Cl]%10)[Cl][Nb]%11%14%17%19([Cl]%15)([Cl]%12)[Cl]%18.[Cl]%20[Nb]%21%22%23%24%25%26[Cl][Nb]%21%27%28%29%30%31[Cl][Nb]%20%22%27%32%33%34[Cl][Nb]%23%32%35%36%37([Cl]%24)[Cl][Nb]%25%28%35%38([Cl]%26)([Cl]%29)[Cl][Nb]%30%33%36%38([Cl]%34)([Cl]%31)[Cl]%37 > <HASH> dd377c9f0d5029e0 $$$$
It's 64 bit long, but at least it doesn't seem to be CRC-64 of obvious parts - I tested the first record until the hash line, without the last linefeed, without the two last linefeeds, removing the first NSC line and combinations of those.Quote Originally Posted by Matt Mahoney View PostI don't know how the hash is computed, but obviously this information is redundant.
EDIT: There is some documentation available that describes 64-bit hashcodes named HASHISY and mentions a web resolving service that seems to calculate those hashcodes based on a simplified chemical formula, e.g. "http://cactus.nci.nih.gov/chemical/s...ccccc1/hashisy" returns the hash "3DB0124A3ECF5ECE" as plain MIME text.
This works for simple SMILES strings like #155653, http://cactus.nci.nih.gov/chemical/s...CCOCCO/hashisy and there really seem to be some calculations involved (instead of a simple file/directory lookup) as e.g. the lower case version http://cactus.nci.nih.gov/chemical/s...ccocco/hashisy doesn't return the same hash - on the other hand, HTTP 404 codes are returned in invalid cases like http://cactus.nci.nih.gov/chemical/s...c1cccc/hashisy or http://cactus.nci.nih.gov/chemical/s...nvalid/hashisy
EDIT: To get less off-topic, a dictionary approach seems to help because of the little variety of numbers coming from their meaning (graphical representation of a structural formula). Fractional parts of numbers have a geometric meaning like .8660 (cosine of 30 degrees). Another interesting thing is that this knowledge could be used to improve compression some more - there are pairs of fractional parts that add to 1 (e.g. .1340 and .8660), so only one of them has to be in the dictionary and the other one could be generated automatically.
Last edited by schnaader; 25th December 2012 at 23:28.
http://schnaader.info
Damn kids. They're all alike.
Did you contact Sebastian Deodorowicz? He's the SCC creator and may be able to get you documentation.
A . means no chemical bond. http://www.daylight.com/dayhtml/doc/...ry.smiles.html gives more details on the SMILES format.