Results 1 to 30 of 1105

Thread: Paq8pxd dict

- Show Printable Version
- Email this Page…
- Advanced Search
- Linear Mode
- Switch to Hybrid Mode
- Switch to Threaded Mode

21st January 2012, 22:14 #1
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
Paq8pxd dict
Dynamic record model
Text/utf detection
dynamic dict preprocess (modified version of XWRT)
0xX0X0X0X0... to 0xXXXX... filter for text

Code:

enwik8 test (option -3) compressed time paq8pxd 20337801 1229 paq8px_v69 20794944 1797 (option -7) compressed time paq8pxd 17596170 11464 paq8px_v69 17939198 15363

Attached Files Attached Files

File Type: zip paq8pxd.zip (207.8 KB, 1851 views)
Last edited by kaitz; 22nd January 2012 at 12:42. Reason: test result

KZo
Reply With Quote Reply With Quote
21st January 2012, 22:16 #2
paqfan

View Profile

View Forum Posts

Private Message
paqfan is offline
Member

Join Date

Jan 2012

Location

Sopianae

Posts

32

Thanks

9

Thanked 0 Times in 0 Posts
For texts? Ok, thank U, I will give it a try!
Thanks!!
Reply With Quote Reply With Quote

26th January 2012, 23:31 #3

Matt Mahoney

Matt Mahoney is offline

Expert

Matt Mahoney's Avatar

Join Date: May 2008
Location: Melbourne, Florida, USA
Posts: 3,271
Thanks: 315; Thanked 841 Times in 506 Posts

Calgary corpus results (14 files to 1 archive). Unfortunately, results are worse than paq8px_v69

Code:

D:\>paq8px -7 calgary-7 c:\res\calgary\*
Creating archive calgary-7.paq8px with 14 file(s)...
1/14 Filename: c:/res/calgary/BIB (111261 bytes)
Block segmentation:
 0 | default | 111261 bytes [0 - 111260]
Compressed from 111261 to 20635 bytes.
2/14 Filename: c:/res/calgary/BOOK1 (768771 bytes)
Block segmentation:
 0 | default | 768771 bytes [0 - 768770]
Compressed from 768771 to 191178 bytes.
3/14 Filename: c:/res/calgary/BOOK2 (610856 bytes)
Block segmentation:
 0 | default | 610856 bytes [0 - 610855]
Compressed from 610856 to 116216 bytes.
4/14 Filename: c:/res/calgary/GEO (102400 bytes)
Block segmentation:
 0 | default | 102400 bytes [0 - 102399]
Compressed from 102400 to 44094 bytes.
5/14 Filename: c:/res/calgary/NEWS (377109 bytes)
Block segmentation:
 0 | default | 377109 bytes [0 - 377108]
Compressed from 377109 to 82789 bytes.
6/14 Filename: c:/res/calgary/OBJ1 (21504 bytes)
Block segmentation:
 0 | default | 21504 bytes [0 - 21503]
Compressed from 21504 to 7280 bytes.
7/14 Filename: c:/res/calgary/OBJ2 (246814 bytes)
Block segmentation:
 0 | default | 246814 bytes [0 - 246813]
Compressed from 246814 to 44111 bytes.
8/14 Filename: c:/res/calgary/PAPER1 (53161 bytes)
Block segmentation:
 0 | default | 53161 bytes [0 - 53160]
Compressed from 53161 to 10389 bytes.
9/14 Filename: c:/res/calgary/PAPER2 (82199 bytes)
Block segmentation:
 0 | default | 82199 bytes [0 - 82198]
Compressed from 82199 to 16461 bytes.
10/14 Filename: c:/res/calgary/PIC (513216 bytes)
Block segmentation:
 0 | default | 513216 bytes [0 - 513215]
Compressed from 513216 to 30828 bytes.
11/14 Filename: c:/res/calgary/PROGC (39611 bytes)
Block segmentation:
 0 | default | 39611 bytes [0 - 39610]
Compressed from 39611 to 8218 bytes.
12/14 Filename: c:/res/calgary/PROGL (71646 bytes)
Block segmentation:
 0 | default | 71646 bytes [0 - 71645]
Compressed from 71646 to 9503 bytes.
13/14 Filename: c:/res/calgary/PROGP (49379 bytes)
Block segmentation:
 0 | default | 49379 bytes [0 - 49378]
Compressed from 49379 to 6688 bytes.
14/14 Filename: c:/res/calgary/TRANS (93695 bytes)
Block segmentation:
 0 | default | 93695 bytes [0 - 93694]
Compressed from 93695 to 9965 bytes.
Total 3141622 bytes compressed to 598550 bytes.
Time 458.69 sec, used 811717915 bytes of memory
D:\>paq8pxd -7 calgary-7d c:\res\calgary\*
Creating archive calgary-7d.paq8pxd with 14 file(s)...
File list (169 bytes)
Compressed from 169 to 100 bytes.
1/14 Filename: c:/res/calgary/BIB (111261 bytes)
Block segmentation:
 0 | text | 111261 bytes [0 - 111260] (wrt: 85981)
Compressed from 111261 to 20801 bytes.
2/14 Filename: c:/res/calgary/BOOK1 (768771 bytes)
Block segmentation:
 0 | text | 173891 bytes [0 - 173890] (wrt: 131001)
 1 | default | 1 bytes [173891 - 173891]
 2 | text | 249971 bytes [173892 - 423862] (wrt: 183473)
 3 | default | 1 bytes [423863 - 423863]
 4 | text | 344907 bytes [423864 - 768770] (wrt: 241344)
Compressed from 768771 to 197989 bytes.
3/14 Filename: c:/res/calgary/BOOK2 (610856 bytes)
Block segmentation:
 0 | text | 610856 bytes [0 - 610855] (wrt: 375627)
Compressed from 610856 to 119444 bytes.
4/14 Filename: c:/res/calgary/GEO (102400 bytes)
Block segmentation:
 0 | default | 102400 bytes [0 - 102399]
Compressed from 102400 to 44145 bytes.
5/14 Filename: c:/res/calgary/NEWS (377109 bytes)
Block segmentation:
 0 | text | 314908 bytes [0 - 314907] (wrt: 238603)
 1 | default | 1 bytes [314908 - 314908]
 2 | text | 3959 bytes [314909 - 318867]
 3 | default | 1 bytes [318868 - 318868]
 4 | text | 58240 bytes [318869 - 377108] (wrt: 49491)
Compressed from 377109 to 88660 bytes.
6/14 Filename: c:/res/calgary/OBJ1 (21504 bytes)
Block segmentation:
 0 | default | 21504 bytes [0 - 21503]
Compressed from 21504 to 7341 bytes.
7/14 Filename: c:/res/calgary/OBJ2 (246814 bytes)
Block segmentation:
 0 | default | 246814 bytes [0 - 246813]
Compressed from 246814 to 44003 bytes.
8/14 Filename: c:/res/calgary/PAPER1 (53161 bytes)
Block segmentation:
 0 | text | 53161 bytes [0 - 53160] (wrt: 40392)
Compressed from 53161 to 11592 bytes.
9/14 Filename: c:/res/calgary/PAPER2 (82199 bytes)
Block segmentation:
 0 | text | 82199 bytes [0 - 82198] (wrt: 59795)
Compressed from 82199 to 18096 bytes.
10/14 Filename: c:/res/calgary/PIC (513216 bytes)
Block segmentation:
 0 | default | 513216 bytes [0 - 513215]
Compressed from 513216 to 38731 bytes.
11/14 Filename: c:/res/calgary/PROGC (39611 bytes)
Block segmentation:
 0 | text | 39611 bytes [0 - 39610] (wrt: 31184)
Compressed from 39611 to 8748 bytes.
12/14 Filename: c:/res/calgary/PROGL (71646 bytes)
Block segmentation:
 0 | text | 71646 bytes [0 - 71645] (wrt: 52840)
Compressed from 71646 to 9693 bytes.
13/14 Filename: c:/res/calgary/PROGP (49379 bytes)
Block segmentation:
 0 | text | 49379 bytes [0 - 49378] (wrt: 36331)
Compressed from 49379 to 6828 bytes.
14/14 Filename: c:/res/calgary/TRANS (93695 bytes)
Block segmentation:
 0 | default | 93695 bytes [0 - 93694]
Compressed from 93695 to 10238 bytes.
Total 3141622 bytes compressed to 626419 bytes.
Time 442.42 sec, used 812319808 bytes of memory

Reply With Quote Reply With Quote

28th January 2012, 17:28 #4
Stephan Busch

View Profile

View Forum Posts

Private Message

Visit Homepage
Stephan Busch is offline
Tester
Stephan Busch's Avatar

Join Date

May 2008

Location

Bremen, Germany

Posts

879

Thanks

476

Thanked 176 Times in 86 Posts
Hi there,

I am currently running paq8pxd -7 on my testsets. So far, the results are not better on textual data.
Here is the output for the wikipedia testset:

1/1 Filename: wiki.tar (1000009216 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 75673565)
2 | default | 768 bytes [100001024 - 100001791]
3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 67339901)
4 | default | 768 bytes [200001792 - 200002559]
5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 61950819)
6 | default | 768 bytes [300002560 - 300003327]
7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 65501064)
8 | default | 768 bytes [400003328 - 400004095]
9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 65785770)
10 | default | 768 bytes [500004096 - 500004863]
11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 6164441
12 | default | 769 bytes [600004863 - 600005631]
13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 66960493)
14 | default | 768 bytes [700005632 - 700006399]
15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 85100616)
16 | default | 768 bytes [800006400 - 800007167]
17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 70613645)
18 | default | 770 bytes [900007166 - 900007935]
19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
20 | default | 1280 bytes [1000007936 - 1000009215]
Compressed from 1000009216 to 143474038 bytes.

Total 1000009216 bytes compressed to 143474071 bytes.
Time 99227.93 sec, used 812320083 bytes of memory
Reply With Quote Reply With Quote
31st January 2012, 17:01 #5
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
@Stephan
Each wrt block has its own dict. Probably this is cause.

Updated version. No progname change. Previous attempt is obsolete.
On enwik8 compression time is about same as in paq8p3

Code:

opt -7 Compression Time px_v69 pxd diff px_v69 pxd enwik6 207610 206343 1267 155 101 world95.txt 351923 350288 1635 451 224 calgary.tar 598118 607317 -9199 457 371 enwik8 17939198 17511910 427288 15363 8238 vlcfile 1634624 1632802 1822 3004 1676

Attachment has some testing results. And yes drt+px_v69 has better results on enwik8 then pxd.

EDIT:
Tested in another pc. (Core2Duo T8300 2.4GHz 2GB RAM)

Code:

paq8pxd -7 enwik9 144773408 63302

Code:

paq8pxd -8 enwik8 17300285 8137(sec) 1626035957(mem)

Attached Files Attached Files

File Type: zip paq8pxd_v1.zip (314.7 KB, 1181 views)
Last edited by kaitz; 1st February 2012 at 16:05. Reason: more test results

KZo
Reply With Quote Reply With Quote
1st February 2012, 21:49 #6
Stephan Busch

View Profile

View Forum Posts

Private Message

Visit Homepage
Stephan Busch is offline
Tester
Stephan Busch's Avatar

Join Date

May 2008

Location

Bremen, Germany

Posts

879

Thanks

476

Thanked 176 Times in 86 Posts
Hi Kaido,

v1 compresses slightly better.

1/1 Filename: wiki.tar (1000009216 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 77295703)
2 | default | 768 bytes [100001024 - 100001791]
3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 69018975)
4 | default | 768 bytes [200001792 - 200002559]
5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 64253847)
6 | default | 768 bytes [300002560 - 300003327]
7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 67024667)
8 | default | 768 bytes [400003328 - 400004095]
9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 67571240)
10 | default | 768 bytes [500004096 - 500004863]
11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 63552163)
12 | default | 769 bytes [600004863 - 600005631]
13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 6880379
14 | default | 768 bytes [700005632 - 700006399]
15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 86301047)
16 | default | 768 bytes [800006400 - 800007167]
17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 72774055)
18 | default | 770 bytes [900007166 - 900007935]
19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
20 | default | 1280 bytes [1000007936 - 1000009215]
Compressed from 1000009216 to 143021889 bytes.

But it doesn't seem to detect 24-bit images - only 8bit seem to be detected:

1/1 Filename: bmp2.tar (633510400 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | hdr | 17 bytes [1024 - 1040]
2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
3 | default | 18876399 bytes [6292497 - 25168895]
4 | hdr | 17 bytes [25168896 - 25168912]
5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
6 | default | 117160751 bytes [64221905 - 181382655]
7 | hdr | 17 bytes [181382656 - 181382672]
8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
9 | default | 83103039 bytes [209083073 - 292186111]
Compressing... 41.54%
Reply With Quote Reply With Quote
1st February 2012, 22:06 #7
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
// Detect .pbm .pgm .ppm image //fails on enwik9 at offset 435132165 (24 bit header )

KZo
Reply With Quote Reply With Quote
2nd February 2012, 19:40 #8
BetaTester

View Profile

View Forum Posts

Private Message
BetaTester is offline
Member BetaTester's Avatar

Join Date

Dec 2010

Location

Brazil

Posts

43

Thanks

0

Thanked 3 Times in 3 Posts
In my tests I found that PAQ, compression becomes greater when:

- Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension

- Within the list of files with the same extension, compression will be greater if the files are in size order, First the largest file, and finally the smallest file.

These latest versions of the PAQ does not give option to change the order of input files, just attack any folder compressing the files in alphabetical order.
Reply With Quote Reply With Quote
2nd February 2012, 23:40 #9
Stephan Busch

View Profile

View Forum Posts

Private Message

Visit Homepage
Stephan Busch is offline
Tester
Stephan Busch's Avatar

Join Date

May 2008

Location

Bremen, Germany

Posts

879

Thanks

476

Thanked 176 Times in 86 Posts
PAQ8pxd_v1 compresses the bitmap testset about 30 MB worse and does not detect all 24-bit images.
-there is no default data in this testset.

1/1 Filename: bmp2.tar (633510400 bytes)
Block segmentation:
0 | default | 1024 bytes [0 - 1023]
1 | hdr | 17 bytes [1024 - 1040]
2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
3 | default | 18876399 bytes [6292497 - 25168895]
4 | hdr | 17 bytes [25168896 - 25168912]
5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
6 | default | 117160751 bytes [64221905 - 181382655]
7 | hdr | 17 bytes [181382656 - 181382672]
8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
9 | default | 83103039 bytes [209083073 - 292186111]
10 | hdr | 17 bytes [292186112 - 292186128]
11 | 8b-image | 11130701 bytes [292186129 - 303316829] (width: 2749)
12 | default | 33393314 bytes [303316830 - 336710143]
13 | hdr | 17 bytes [336710144 - 336710160]
14 | 8b-image | 6016000 bytes [336710161 - 342726160] (width: 2000)
15 | default | 15999209 bytes [342726161 - 358725369]
16 | hdr | 18 bytes [358725370 - 358725387]
17 | 8b-image | 2048 bytes [358725388 - 358727435] (width: 512)
18 | default | 498916 bytes [358727436 - 359226351]
19 | hdr | 18 bytes [359226352 - 359226369]
20 | 8b-image | 5658 bytes [359226370 - 359232027] (width: 1)
21 | default | 518977 bytes [359232028 - 359751004]
22 | hdr | 18 bytes [359751005 - 359751022]
23 | 24b-image | 57600 bytes [359751023 - 359808622] (width: 15)
24 | default | 263779 bytes [359808623 - 360072401]
25 | hdr | 18 bytes [360072402 - 360072419]
26 | 24b-image | 30957768 bytes [360072420 - 391030187] (width: 3897)
27 | default | 12457556 bytes [391030188 - 403487743]
28 | hdr | 17 bytes [403487744 - 403487760]
29 | 8b-image | 7375872 bytes [403487761 - 410863632] (width: 3136)
30 | default | 22129647 bytes [410863633 - 432993279]
31 | hdr | 17 bytes [432993280 - 432993296]
32 | 8b-image | 3429216 bytes [432993297 - 436422512] (width: 226
33 | default | 10289295 bytes [436422513 - 446711807]
34 | hdr | 17 bytes [446711808 - 446711824]
35 | 8b-image | 6291456 bytes [446711825 - 453003280] (width: 3072)
36 | default | 18876399 bytes [453003281 - 471879679]
37 | hdr | 17 bytes [471879680 - 471879696]
38 | 8b-image | 6016000 bytes [471879697 - 477895696] (width: 300
39 | default | 18050031 bytes [477895697 - 495945727]
40 | hdr | 17 bytes [495945728 - 495945744]
41 | 8b-image | 6016000 bytes [495945745 - 501961744] (width: 300
42 | default | 18050031 bytes [501961745 - 520011775]
43 | hdr | 17 bytes [520011776 - 520011792]
44 | 8b-image | 7375872 bytes [520011793 - 527387664] (width: 3136)
45 | default | 22129647 bytes [527387665 - 549517311]
46 | hdr | 17 bytes [549517312 - 549517328]
47 | 8b-image | 7375872 bytes [549517329 - 556893200] (width: 3136)
48 | default | 11772012 bytes [556893201 - 568665212]
49 | hdr | 18 bytes [568665213 - 568665230]
50 | 24b-image | 9984 bytes [568665231 - 568675214] (width: 9984)
51 | default | 10347633 bytes [568675215 - 579022847]
52 | hdr | 17 bytes [579022848 - 579022864]
53 | 8b-image | 12121088 bytes [579022865 - 591143952] (width: 4256)
54 | default | 36365295 bytes [591143953 - 627509247]
55 | hdr | 17 bytes [627509248 - 627509264]
56 | 8b-image | 6000000 bytes [627509265 - 633509264] (width: 3000)
57 | default | 1135 bytes [633509265 - 633510399]
Compressed from 633510400 to 269355793 bytes.

Total 633510400 bytes compressed to 269355825 bytes.
Time 64796.59 sec, used 881831523 bytes of memory
Reply With Quote Reply With Quote
5th February 2012, 19:36 #10
Karhunen

View Profile

View Forum Posts

Private Message
Karhunen is offline
Member Karhunen's Avatar

Join Date

Dec 2011

Location

USA

Posts

91

Thanks

2

Thanked 1 Time in 1 Post
Quote Originally Posted by BetaTester View Post

In my tests I found that PAQ, compression becomes greater when:

- Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension

This brings up a question I have: If the //Detect fails to match any known filestream (PGM,BMP etc), instead of default mode, would anyone want a mode that falls back to an uncompressed format mode like PPM or BMP? This may not be possible, since odd stream lengths could not be a valid bitmap stream.
Reply With Quote Reply With Quote
11th February 2012, 15:04 #11
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
Code:

enwik8 -7 17045653 Time 9428.17 sec, used 853029829 bytes of memory -8 16848214 Time 9535.25 sec, used 1658336197 bytes of memory

Image detection is back, so do not try to compress enwik9.

Attached Files Attached Files

File Type: zip paq8px_v2.zip (170.8 KB, 1222 views)
KZo
Reply With Quote Reply With Quote
12th February 2012, 03:54 #12
Matt Mahoney

View Profile

View Forum Posts

Private Message

Visit Homepage
Matt Mahoney is offline
Expert
Matt Mahoney's Avatar

Join Date

May 2008

Location

Melbourne, Florida, USA

Posts

3,271

Thanks

315

Thanked 841 Times in 506 Posts
I posted your results. http://mattmahoney.net/dc/text.html#1448
I wonder if with -8 you might be able to move up to the #3 spot.
Reply With Quote Reply With Quote
12th February 2012, 20:15 #13
toino2000

View Profile

View Forum Posts

Private Message
toino2000 is offline
Member

Join Date

May 2010

Location

France

Posts

4

Thanks

0

Thanked 0 Times in 0 Posts
For detection, why don't used the extension format in first ?

This image won't be see as jpeg

Name: img.jpg Views: 16841 Size: 2.6 KB

But this image would be see as an jpeg image :

Name: img (1).jpg Views: 17303 Size: 2.6 KB

And, if you want to compress theses images :

"
Files list <14 bytes>
Compressed from 14 to 17 bytes.
"

Maybe it's will be possible to not compress if after compress it's not thinner ?
Reply With Quote Reply With Quote

24th February 2012, 02:22 #14

kaitz

kaitz is offline

Member

Join Date: May 2008
Location: Estonia
Posts: 663
Thanks: 268; Thanked 560 Times in 287 Posts

New

Modified im8model (faster/slightly better in my tests)
base64 in e-mails (recursion. yes, it can fail on transform)
fixed enwik9 img problem (i hope)

Code:

option -7
zone_plate.pgm 6000017
 Compressed Time
paq8px_v69 404964 238 
paq8pxd_v3 355834 225
hdr.pgm 6291473
paq8px_v69 1556167 178
paq8pxd_v3 1553288 173
bridge.pgm whas about 1700 bytes larger with paq8pxd_v3

Code:

Thunderbird inbox 30118877 bytes
option -7
paq8pxd_v3 17364195 bytes, time 4207.49 sec, 827511670 mem
paq8px_v69 17677583 bytes, time 5001.42 sec, 811820566 mem

Code:

paq8pxd_v3 -8 enwik8 16847903 bytes Time 8300 sec, used 1658336197 mem
paq8pxd_v3 -8 enwik9 136777893 bytes Time 82822 sec, used 1658336197 mem
paq8pxd_v3 -7 enwik8 17045354 bytes Time 8023 sec, used 853029829 mem
paq8pxd_v3 -7 enwik9 140110094 bytes Time 80069 sec, used 853029829 mem

Attached Files Attached Files

File Type: zip paq8pxd_v3.zip (173.7 KB, 2403 views)

Last edited by kaitz; 26th February 2012 at 20:31. Reason: base64 test & enwik8/9 tests

KZo

Reply With Quote Reply With Quote

19th April 2012, 21:10 #15
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
Added 4bit bmp
base64 fixes
other fixes
combined wrt files to one
etc.
Sample file has multi-base64 encoded data. From web.

Attached Files Attached Files

File Type: 7z paq8pxd_v4.7z (145.4 KB, 2565 views)
File Type: 7z b64sample.7z (612.5 KB, 1093 views)
KZo
Reply With Quote Reply With Quote
19th April 2012, 21:22 #16
nimdamsk

View Profile

View Forum Posts

Private Message
nimdamsk is offline
Member

Join Date

Jan 2007

Location

Moscow

Posts

241

Thanks

0

Thanked 3 Times in 1 Post
Is it possible to add 32-bit image filters?
Reply With Quote Reply With Quote

20th April 2012, 04:24 #17

Matt Mahoney

Matt Mahoney is offline

Expert

Matt Mahoney's Avatar

Join Date: May 2008
Location: Melbourne, Florida, USA
Posts: 3,271
Thanks: 315; Thanked 841 Times in 506 Posts

v3 results are posted to http://mattmahoney.net/dc/text.html#1368
Somehow this escaped my attention when you released it. It is now #3, beating lpaq9m.
Anyway, if you want to test v4 on enwik9 I will post it too. I am testing on silesia. So far it is 1958K on dickens, beating paq8px_v69.

Edit: paq8pxd_v4 -8 takes the top position on the silesia benchmark. http://mattmahoney.net/dc/silesia.html
Compression took 15 hours on a 2 GHz T3200. Testing decompression now.

Compression was better on most files but somewhat worse on samba. Here is the output.

Code:

D:\silesia>for %i in (*.) do paq8pxd_v4 -8 %i
D:\silesia>paq8pxd_v4 -8 dickens
Creating archive dickens.paq8pxd with 1 file(s)...
File list (18 bytes)
Compressed from 18 to 20 bytes.
1/1 Filename: dickens (10192446 bytes)
Block segmentation:
 0 | text | 10192446 bytes [0 - 10192445] (wrt: 6006941)
Compressed from 10192446 to 1958629 bytes.
Total 10192446 bytes compressed to 1958659 bytes.
Time 1124.14 sec, used 1633424260 bytes of memory
D:\silesia>paq8pxd_v4 -8 mozilla
Creating archive mozilla.paq8pxd with 1 file(s)...
File list (18 bytes)
Compressed from 18 to 20 bytes.
1/1 Filename: mozilla (51220480 bytes)
Block segmentation:
 0 | default | 16003634 bytes [0 - 16003633]
 1 | text | 565152 bytes [16003634 - 16568785] (wrt: 392572)
 2 | default | 33443184 bytes [16568786 - 50011969]
 3 | utf-8 | 575462 bytes [50011970 - 50587431] (wrt: 468965)
 4 | default | 51416 bytes [50587432 - 50638847]
 5 | jpeg | 9407 bytes [50638848 - 50648254]
 6 | default | 833 bytes [50648255 - 50649087]
 7 | jpeg | 49629 bytes [50649088 - 50698716]
 8 | default | 547 bytes [50698717 - 50699263]
 9 | hdr | 44 bytes [50699264 - 50699307]
 10 | audio | 27760 bytes [50699308 - 50727067] (8b mono)
 11 | default | 493412 bytes [50727068 - 51220479]
Compressed from 51220480 to 10229462 bytes.
Total 51220480 bytes compressed to 10229492 bytes.
Time 9270.58 sec, used 1862708696 bytes of memory
D:\silesia>paq8pxd_v4 -8 mr
Creating archive mr.paq8pxd with 1 file(s)...
File list (12 bytes)
Compressed from 12 to 15 bytes.
1/1 Filename: mr (9970564 bytes)
Block segmentation:
 0 | default | 9970564 bytes [0 - 9970563]
Compressed from 9970564 to 2060422 bytes.
Total 9970564 bytes compressed to 2060447 bytes.
Time 1349.45 sec, used 1565701897 bytes of memory
D:\silesia>paq8pxd_v4 -8 nci
Creating archive nci.paq8pxd with 1 file(s)...
File list (14 bytes)
Compressed from 14 to 16 bytes.
1/1 Filename: nci (33553445 bytes)
Block segmentation:
 0 | text | 33553445 bytes [0 - 33553444]
Compressed from 33553445 to 923150 bytes.
Total 33553445 bytes compressed to 923176 bytes.
Time 4439.04 sec, used 1633424264 bytes of memory
D:\silesia>paq8pxd_v4 -8 ooffice
Creating archive ooffice.paq8pxd with 1 file(s)...
File list (17 bytes)
Compressed from 17 to 19 bytes.
1/1 Filename: ooffice (6152192 bytes)
Block segmentation:
 0 | default | 4228 bytes [0 - 4227]
 1 | exe | 5012819 bytes [4228 - 5017046]
 2 | default | 26830 bytes [5017047 - 5043876]
 3 | exe | 253183 bytes [5043877 - 5297059]
 4 | default | 855132 bytes [5297060 - 6152191]
Compressed from 6152192 to 1418239 bytes.
Total 6152192 bytes compressed to 1418268 bytes.
Time 1118.72 sec, used 1582557220 bytes of memory
D:\silesia>paq8pxd_v4 -8 osdb
Creating archive osdb.paq8pxd with 1 file(s)...
File list (15 bytes)
Compressed from 15 to 17 bytes.
1/1 Filename: osdb (10085684 bytes)
Block segmentation:
 0 | default | 10085684 bytes [0 - 10085683]
Compressed from 10085684 to 2069934 bytes.
Total 10085684 bytes compressed to 2069961 bytes.
Time 1776.86 sec, used 1565701895 bytes of memory
D:\silesia>paq8pxd_v4 -8 reymont
Creating archive reymont.paq8pxd with 1 file(s)...
File list (17 bytes)
Compressed from 17 to 19 bytes.
1/1 Filename: reymont (6627202 bytes)
Block segmentation:
 0 | text | 6501239 bytes [0 - 6501238]
 1 | default | 125963 bytes [6501239 - 6627201]
Compressed from 6627202 to 812189 bytes.
Total 6627202 bytes compressed to 812218 bytes.
Time 1064.92 sec, used 1633426308 bytes of memory
D:\silesia>paq8pxd_v4 -8 samba
Creating archive samba.paq8pxd with 1 file(s)...
File list (16 bytes)
Compressed from 16 to 18 bytes.
1/1 Filename: samba (21606400 bytes)
Block segmentation:
 0 | default | 279004 bytes [0 - 279003]
 1 | text | 1658664 bytes [279004 - 1937667] (wrt: 1040945)
 2 | default | 131757 bytes [1937668 - 2069424]
 3 | text | 2661772 bytes [2069425 - 4731196] (wrt: 1699529)
 4 | default | 1092855 bytes [4731197 - 5824051]
 5 | text | 725004 bytes [5824052 - 6549055] (wrt: 562098)
 6 | default | 420432 bytes [6549056 - 6969487]
 7 | jpeg | 8020 bytes [6969488 - 6977507]
 8 | default | 461300 bytes [6977508 - 7438807]
 9 | text | 678792 bytes [7438808 - 8117599] (wrt: 554030)
 10 | default | 9673 bytes [8117600 - 8127272]
 11 | text | 13132289 bytes [8127273 - 21259561] (wrt: 9307955)
 12 | default | 346838 bytes [21259562 - 21606399]
Compressed from 21606400 to 2853155 bytes.
Total 21606400 bytes compressed to 2853183 bytes.
Time 2701.86 sec, used 1794596602 bytes of memory
D:\silesia>paq8pxd_v4 -8 sao
Creating archive sao.paq8pxd with 1 file(s)...
File list (13 bytes)
Compressed from 13 to 16 bytes.
1/1 Filename: sao (7251944 bytes)
Block segmentation:
 0 | default | 7251944 bytes [0 - 7251943]
Compressed from 7251944 to 3776301 bytes.
Total 7251944 bytes compressed to 3776327 bytes.
Time 1378.77 sec, used 1565701896 bytes of memory
D:\silesia>paq8pxd_v4 -8 webster
Creating archive webster.paq8pxd with 1 file(s)...
File list (18 bytes)
Compressed from 18 to 21 bytes.
1/1 Filename: webster (41458703 bytes)
Block segmentation:
 0 | text | 41458703 bytes [0 - 41458702] (wrt: 29889928)
Compressed from 41458703 to 4907154 bytes.
Total 41458703 bytes compressed to 4907185 bytes.
Time 5363.06 sec, used 1633424260 bytes of memory
D:\silesia>paq8pxd_v4 -8 x-ray
Creating archive x-ray.paq8pxd with 1 file(s)...
File list (15 bytes)
Compressed from 15 to 17 bytes.
1/1 Filename: x-ray (8474240 bytes)
Block segmentation:
 0 | default | 8474240 bytes [0 - 8474239]
Compressed from 8474240 to 3587948 bytes.
Total 8474240 bytes compressed to 3587975 bytes.
Time 1407.96 sec, used 1565701894 bytes of memory
D:\silesia>paq8pxd_v4 -8 xml
Creating archive xml.paq8pxd with 1 file(s)...
File list (13 bytes)
Compressed from 13 to 15 bytes.
1/1 Filename: xml (5345280 bytes)
Block segmentation:
 0 | text | 5345279 bytes [0 - 5345278] (wrt: 3560922)
 1 | default | 1 bytes [5345279 - 5345279]
Compressed from 5345280 to 264663 bytes.
Total 5345280 bytes compressed to 264688 bytes.
Time 609.45 sec, used 1633426312 bytes of memory
D:\silesia>dir
 Volume in drive D is DATA
 Volume Serial Number is 5CE8-C77D
 Directory of D:\silesia
04/20/2012 04:23 AM <DIR> .
04/20/2012 04:23 AM <DIR> ..
04/12/2002 01:21 PM 10,192,446 dickens
04/19/2012 08:05 PM 1,958,659 dickens.paq8pxd
05/31/2002 07:50 PM 51,220,480 mozilla
04/19/2012 10:40 PM 10,229,492 mozilla.paq8pxd
03/20/2003 11:12 AM 9,970,564 mr
04/19/2012 11:02 PM 2,060,447 mr.paq8pxd
04/02/2002 10:21 PM 33,553,445 nci
04/20/2012 12:16 AM 923,176 nci.paq8pxd
07/04/2002 05:00 AM 6,152,192 ooffice
04/20/2012 12:35 AM 1,418,268 ooffice.paq8pxd
04/11/2002 06:56 PM 10,085,684 osdb
04/20/2012 01:05 AM 2,069,961 osdb.paq8pxd
04/02/2002 11:40 PM 6,627,202 reymont
04/20/2012 01:22 AM 812,218 reymont.paq8pxd
03/25/2002 02:34 PM 21,606,400 samba
04/20/2012 02:07 AM 2,853,183 samba.paq8pxd
03/24/2002 01:38 AM 7,251,944 sao
04/20/2012 02:30 AM 3,776,327 sao.paq8pxd
03/25/2002 10:39 AM 41,458,703 webster
04/20/2012 04:00 AM 4,907,185 webster.paq8pxd
04/04/2002 02:00 PM 8,474,240 x-ray
04/20/2012 04:23 AM 3,587,975 x-ray.paq8pxd
12/01/2000 12:54 AM 5,345,280 xml
04/20/2012 04:33 AM 264,688 xml.paq8pxd
 24 File(s) 246,800,159 bytes
 2 Dir(s) 39,701,053,440 bytes free

Edit: decompression checks OK. Decompression took 9 hours. Looking back at compression times, it was also 9 hours, not 15. My bad.

Last edited by Matt Mahoney; 21st April 2012 at 03:46.

Reply With Quote Reply With Quote

23rd April 2012, 08:23 #18

kaitz

kaitz is offline

Member

Join Date: May 2008
Location: Estonia
Posts: 663
Thanks: 268; Thanked 560 Times in 287 Posts

Code:

paq8pxd_v4 -8 enwik9
 
Total 1000000000 bytes compressed to 135027170 bytes.
Time 88409.59 sec, used 1633424261 bytes of memory
paq8pxd_v4 -8 enwik8
Total 100000000 bytes compressed to 16642941 bytes.
Time 8395.20 sec, used 1633424261 bytes of memory

KZo

Reply With Quote Reply With Quote

23rd April 2012, 22:51 #19
Matt Mahoney

View Profile

View Forum Posts

Private Message

Visit Homepage
Matt Mahoney is offline
Expert
Matt Mahoney's Avatar

Join Date

May 2008

Location

Melbourne, Florida, USA

Posts

3,271

Thanks

315

Thanked 841 Times in 506 Posts
Nice improvement on LTCB. http://mattmahoney.net/dc/text.html#1350

I also tested on the Silesia benchmark with -5 through -8. http://mattmahoney.net/dc/silesia.html
Reply With Quote Reply With Quote
23rd April 2012, 23:15 #20
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
It seems that multiple wrt blocks is messing up modeling in 'samba'. Maybe building one dict for whole file will improve results.
Like building dict from all text blocks and using that for each text block.

KZo
Reply With Quote Reply With Quote
1st November 2012, 10:12 #21
Anitatoom

View Profile

View Forum Posts

Private Message
Anitatoom is offline
Member Anitatoom's Avatar

Join Date

Nov 2012

Location

Thailand

Posts

4

Thanks

0

Thanked 0 Times in 0 Posts
Please help me, I can compile paq8px but I can't compile paq8pxd.

when compile it error that

g++ paq8pxd_v4.cpp -DUNIX -DNOASM -O3 -s -march=nocona -O2 -pipe -o paq8pxd_v4
In file included from paq8pxd_v4.cpp:4393:
wrtpre.cpp: In function ?int min(int, int)?:
wrtpre.cpp:39: error: redefinition of ?int min(int, int)?
paq8pxd_v4.cpp:547: error: ?int min(int, int)? previously defined here
wrtpre.cpp: In function ?int max(int, int)?:
wrtpre.cpp:40: error: redefinition of ?int max(int, int)?
paq8pxd_v4.cpp:548: error: ?int max(int, int)? previously defined here
paq8pxd_v4.cpp: In function ?void compressRecursive(FILE*, long int, Encoder&, char*, int, int, int)?:
paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
paq8pxd_v4.cpp: In function ?int main(int, char**)?:
paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?

Last edited by Anitatoom; 1st November 2012 at 10:32.
Reply With Quote Reply With Quote
1st November 2012, 12:55 #22
encode

View Profile

View Forum Posts

Private Message

Visit Homepage
encode is offline
The Founder encode's Avatar

Join Date

May 2006

Location

Moscow, Russia

Posts

4,147

Thanks

617

Thanked 558 Times in 213 Posts
Remove min()/max() from wrtpre.cpp
Reply With Quote Reply With Quote
1st November 2012, 13:42 #23
Anitatoom

View Profile

View Forum Posts

Private Message
Anitatoom is offline
Member Anitatoom's Avatar

Join Date

Nov 2012

Location

Thailand

Posts

4

Thanks

0

Thanked 0 Times in 0 Posts
Thank you :D
Reply With Quote Reply With Quote

23rd December 2012, 17:05 #24

kaitz

kaitz is offline

Member

Join Date: May 2008
Location: Estonia
Posts: 663
Thanks: 268; Thanked 560 Times in 287 Posts

I am working on newer version.
Numbers in wrt preprocessing are treated as a-z.

nci, xml from Silesia corpus

Code:

D:\test>paq8pxd_v6.exe -7 nciCreating archive nci.paq8pxd with 1 file(s)...
File list (14 bytes)
Compressed from 14 to 16 bytes.
1/1 Filename: nci (33553445 bytes)
Block segmentation:
 0 | text | 33553445 bytes [0 - 33553444]
Compressed from 33553445 to 850808 bytes.
Total 33553445 bytes compressed to 850834 bytes.
Time 4182.96 sec, used 844903304 bytes of memory
D:\test>paq8pxd_v6.exe -7 xml
Creating archive xml.paq8pxd with 1 file(s)...
File list (13 bytes)
Compressed from 13 to 15 bytes.
1/1 Filename: xml (5345280 bytes)
Block segmentation:
 0 | text | 5345279 bytes [0 - 5345278] (wrt: 3546174)
 1 | default | 1 bytes [5345279 - 5345279]
Compressed from 5345280 to 264647 bytes.
Total 5345280 bytes compressed to 264672 bytes.
Time 689.62 sec, used 844905352 bytes of memory

There is mistake in displaying if text is wrt processed.
Some text files suffer slight loss of compression.

KZo

Reply With Quote Reply With Quote

23rd December 2012, 18:41 #25
Matt Mahoney

View Profile

View Forum Posts

Private Message

Visit Homepage
Matt Mahoney is offline
Expert
Matt Mahoney's Avatar

Join Date

May 2008

Location

Melbourne, Florida, USA

Posts

3,271

Thanks

315

Thanked 841 Times in 506 Posts
That's a huge improvement on nci.
Reply With Quote Reply With Quote
24th December 2012, 00:33 #26
kaitz

View Profile

View Forum Posts

Private Message
kaitz is offline
Member

Join Date

May 2008

Location

Estonia

Posts

663

Thanks

268

Thanked 560 Times in 287 Posts
Quote Originally Posted by Matt Mahoney View Post

That's a huge improvement on nci.

Dict. mostly contains numbers. I limited numwords to minimum 3 bytes. (excluding 34,3g etc, allowing f4,k2 etc)
world95.txt suffers little less and nci loses about 6 kb.
I am no expert so i can only guess how much can nci be actually compressed.

KZo
Reply With Quote Reply With Quote
24th December 2012, 04:31 #27
Matt Mahoney

View Profile

View Forum Posts

Private Message

Visit Homepage
Matt Mahoney is offline
Expert
Matt Mahoney's Avatar

Join Date

May 2008

Location

Melbourne, Florida, USA

Posts

3,271

Thanks

315

Thanked 841 Times in 506 Posts
I have so far not found any documentation on the nci file format but I did find http://cactus.nci.nih.gov/ncidb2.1/ which apparently lets you query the same database to show chemical structures by NSC number. For example, the first record in the file is:

Code:

155542 ROtclserve11150011212D 0 0.00000 0.000001049521 31 32 0 0 0 0 0 0 0 0 1 V2000 2.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 0.0000 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0 3.0000 1.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 1.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.5981 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.5981 2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 3.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -1.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1340 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1340 -2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -3.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 -2.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8660 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 0.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3800 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -0.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2.4631 1.3100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 0.3800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 6.1350 1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 6.1350 2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.7321 3.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.3291 2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.5970 -1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.5970 -2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.0000 -3.6200 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.4030 -2.8100 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.4030 -1.1900 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 3 4 1 0 0 0 0 4 5 2 0 0 0 0 5 6 1 0 0 0 0 6 7 2 0 0 0 0 7 8 1 0 0 0 0 8 9 2 0 0 0 0 4 9 1 0 0 0 0 2 10 1 0 0 0 0 10 11 3 0 0 0 0 2 12 1 0 0 0 0 12 13 2 0 0 0 0 13 14 1 0 0 0 0 14 15 2 0 0 0 0 15 16 1 0 0 0 0 16 17 2 0 0 0 0 12 17 1 0 0 0 0 1 18 1 0 0 0 0 1 19 1 0 0 0 0 1 20 1 0 0 0 0 3 21 1 0 0 0 0 5 22 1 0 0 0 0 6 23 1 0 0 0 0 7 24 1 0 0 0 0 8 25 1 0 0 0 0 9 26 1 0 0 0 0 13 27 1 0 0 0 0 14 28 1 0 0 0 0 15 29 1 0 0 0 0 16 30 1 0 0 0 0 17 31 1 0 0 0 0 M END > <NSC> 155542 > <CAS_RN> 17424-68-9 > <SMILES> C[C](NC1=CC=CC=C1)(C#N)C2=CC=CC=C2 > <HASH> 4d5775e8d9fc4fd3 $$$$

The first number (155542) is the NSC number, which you can search and show the chemical structure.

The numbers 31 32 are the number of atoms and the number of chemical bonds. The next 2 sections have 31 and 32 lines respectively. I don't know what the rest of the line means.

Each line shows (I guess) the x,y,z coordinates of the atom for drawing a 3-D model, followed by the chemical symbol such as C for carbon. I don't know what the rest of the numbers are. I don't know why the z coordinate is always 0 because a carbon atom with 4 single bonds has a tetrahedral structure which should mean that attached atoms would have nonzero z values.

The next 32 lines show the chemical bonds. The first 2 numbers are the numbers of the atoms (from 1 to 31) and the next number is the type of bond (1 for single, 2 for double, etc). These should be predictable by the x,y,z coordinates because they should be at a characteristic distance depending on the atoms, and also by the fact that each atom has a characteristic valence or total number of bonds (4 for C, 3 for N, 2 for O and S, 1 for H, Cl, Br). I don't know what the other numbers are.

Note that the NSC number appears twice. It looks like they are numbered consecutively as well.

SMILES is a string that gives the chemical structure in a canonical form. Since the previous data gives the chemical structure, this string should be predictable. The string omits H and uses = and # to indicate double and triple bonds and numbers to indicate bonds between distant parts of the molecule (forming rings). See http://en.wikipedia.org/wiki/Simplif...e-entry_system

Edit: Notice that the atoms in the SMILES string appear in the same order as in the list of atoms above. SMILES omits H, so these appear last.

I don't know how the hash is computed, but obviously this information is redundant.
Last edited by Matt Mahoney; 24th December 2012 at 05:11.
Reply With Quote Reply With Quote
25th December 2012, 22:01 #28
schnaader

View Profile

View Forum Posts

Private Message

Visit Homepage
schnaader is offline
Programmer schnaader's Avatar

Join Date

May 2008

Location

Hessen, Germany

Posts

666

Thanks

366

Thanked 287 Times in 149 Posts
Good analysis so far, though there seems to be more - there are some "surprising" records with slightly unexpected variations and SMILES strings, e.g. #156316:

Code:

156316 ROtclserve11150011212D 0 0.00000 0.00000115469 999-99-9 66100 0 0 0 0 0 0 0 0 7 V2000 (snip)table 1(snip) (snip)table 2(snip) M CHG 2 37 -1 40 1 M RAD 8 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 M RAD 8 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 M RAD 8 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 M RAD 8 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 M RAD 4 33 2 34 2 35 2 36 2 M END > <NSC> 156316 > <CAS_RN> 999-99-9 > <SMILES> [Cl-].CC[N+](CC)(CC)CC.[Cl]1[Nb]234567[Cl][Nb]289%10%11%12[Cl][Nb]138%13%14%15[Cl][Nb]4%13%16%17%18([Cl]5)[Cl][Nb]69%16%19([Cl]7)([Cl]%10)[Cl][Nb]%11%14%17%19([Cl]%15)([Cl]%12)[Cl]%18.[Cl]%20[Nb]%21%22%23%24%25%26[Cl][Nb]%21%27%28%29%30%31[Cl][Nb]%20%22%27%32%33%34[Cl][Nb]%23%32%35%36%37([Cl]%24)[Cl][Nb]%25%28%35%38([Cl]%26)([Cl]%29)[Cl][Nb]%30%33%36%38([Cl]%34)([Cl]%31)[Cl]%37 > <HASH> dd377c9f0d5029e0 $$$$

Note the fixed 3-digit layout so that the two "line counts" are not seperated by a space ("66100" instead of "66 100"), the "M" table that usually is "M END" only, but is longer here, and the huge SMILES string containing some dots (I don't have a clue what these are for) and coded numbers ("%10") - % seems to initiate a two digit number - EDIT: OK, '%' preceding a label above 9 is described on the Wikipedia page - there also is an example containing a dot, though I haven't found an explanation yet.

Quote Originally Posted by Matt Mahoney View Post

I don't know how the hash is computed, but obviously this information is redundant.

It's 64 bit long, but at least it doesn't seem to be CRC-64 of obvious parts - I tested the first record until the hash line, without the last linefeed, without the two last linefeeds, removing the first NSC line and combinations of those.

EDIT: There is some documentation available that describes 64-bit hashcodes named HASHISY and mentions a web resolving service that seems to calculate those hashcodes based on a simplified chemical formula, e.g. "http://cactus.nci.nih.gov/chemical/s...ccccc1/hashisy" returns the hash "3DB0124A3ECF5ECE" as plain MIME text.

This works for simple SMILES strings like #155653, http://cactus.nci.nih.gov/chemical/s...CCOCCO/hashisy and there really seem to be some calculations involved (instead of a simple file/directory lookup) as e.g. the lower case version http://cactus.nci.nih.gov/chemical/s...ccocco/hashisy doesn't return the same hash - on the other hand, HTTP 404 codes are returned in invalid cases like http://cactus.nci.nih.gov/chemical/s...c1cccc/hashisy or http://cactus.nci.nih.gov/chemical/s...nvalid/hashisy

EDIT: To get less off-topic, a dictionary approach seems to help because of the little variety of numbers coming from their meaning (graphical representation of a structural formula). Fractional parts of numbers have a geometric meaning like .8660 (cosine of 30 degrees). Another interesting thing is that this knowledge could be used to improve compression some more - there are pairs of fractional parts that add to 1 (e.g. .1340 and .8660), so only one of them has to be in the dictionary and the other one could be generated automatically.
Last edited by schnaader; 25th December 2012 at 23:28.

http://schnaader.info
Damn kids. They're all alike.
Reply With Quote Reply With Quote
25th December 2012, 23:12 #29
m^2

View Profile

View Forum Posts

Private Message

Visit Homepage
m^2 is offline
Member m^2's Avatar

Join Date

Sep 2008

Location

Ślůnsk, PL

Posts

1,610

Thanks

30

Thanked 65 Times in 47 Posts
Did you contact Sebastian Deodorowicz? He's the SCC creator and may be able to get you documentation.

https://extrememoderate.wordpress.com
Reply With Quote Reply With Quote
26th December 2012, 20:58 #30
Matt Mahoney

View Profile

View Forum Posts

Private Message

Visit Homepage
Matt Mahoney is offline
Expert
Matt Mahoney's Avatar

Join Date

May 2008

Location

Melbourne, Florida, USA

Posts

3,271

Thanks

315

Thanked 841 Times in 506 Posts
A . means no chemical bond. http://www.daylight.com/dayhtml/doc/...ry.smiles.html gives more details on the SMILES format.
Reply With Quote Reply With Quote

« Previous Thread | Next Thread »

Similar Threads

FreeArc compression suite (4x4, Tornado, REP, Delta, Dict...)

By Bulat Ziganshin in forum Data Compression

Replies: 590
Last Post: 13th December 2024, 22:05
Dict preprocessor

By pat357 in forum Data Compression

Replies: 5
Last Post: 2nd May 2014, 22:51

Tags for this Thread

apm, base64, base85, cd, eol, gif, jpeg, lzss, mdf, mrb, paq, paq8, paq8px, paq8pxd, ppmd, wrt, zlib

View Tag Cloud

Posting Permissions

You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
[VIDEO] code is On
HTML code is Off

Forum Rules

All times are GMT +3. The time now is 11:19.

Thread: Paq8pxd dict

Paq8pxd dict

Similar Threads

FreeArc compression suite (4x4, Tornado, REP, Delta, Dict...)

Dict preprocessor

Tags for this Thread

Posting Permissions