I'm not very familiar with MinGW, but memory-mapping should be part of the Windows API, so if you can call Windows API functions, it should just work. The compiler doesn't need to support it. I've built and run code that uses mmap in Cygwin and it worked fine.Quote Originally Posted by Kennon Conrad View PostMemory mapping seems like a good idea, but from what I am seeing it is not available with standard MinGW 32 bit gcc. But then again, the time to move up from 32 bit gcc under Windows may be approaching.
I don't know a whole lot about how the Windows OS handles I/O. I do know that writing one giant buffer doesn't work so well. Tree did that originally and had a pretty long stall at the end to flush the buffer, much like I seem to be seeing with LzTurbo.
I/O isn't supposed to be difficult. All that's really important is that buffering exists at various levels and reads and writes go to and from the device in blocks. If you call system calls directly, you should read and write in chunks (ideally multiples of the system's block size, which is always a power of 2); the reason is that system calls are expensive and too many will cause a performance hit. If you use stdio, you shouldn't have to think about this. Buffering a huge amount and writing it all at once uses lots of memory and it cuts off the opportunity to pipeline (it takes time to physically write data to the device and this can happen in parallel with computation if you start writing while the computation is still underway).
I saw this on minGW's site:http://www.mingw.org/node/21.Quote Originally Posted by nburns View PostI'm not very familiar with MinGW, but memory-mapping should be part of the Windows API, so if you can call Windows API functions, it should just work. The compiler doesn't need to support it. I've built and run code that uses mmap in Cygwin and it worked fine.
I/O isn't supposed to be difficult. All that's really important is that buffering exists at various levels and reads and writes go to and from the device in blocks. If you call system calls directly, you should read and write in chunks (ideally multiples of the system's block size, which is always a power of 2); the reason is that system calls are expensive and too many will cause a performance hit. If you use stdio, you shouldn't have to think about this. Buffering a huge amount and writing it all at once uses lots of memory and it cuts off the opportunity to pipeline (it takes time to physically write data to the device and this can happen in parallel with computation if you start writing while the computation is still underway).
Yes, I/O shouldn't be difficult. That's why the 7 second stall when decompressing with LzTurbo surprises me.
I was testing tree v0.12 in 64 bit Linux. On a small text file (enwik6, 1 MB), TreeCompress.exe under Wine 1.6 gives an out of memory error. Next I compiled with gcc 4.8.2 -O3 and I get a segmentation fault. Then I tried options -O3 -m32 and finally it runs, but enwik8 gives fast but poor compression and enwik9 gives an error (symbol limit exceeded). Here is the output:
Here is the test script.Code:matt@matt-Latitude-E6510:~$ testtree enwik8 TreePreEncode enwik8 enwik8.tpre Pre and capital encoding 100000000 bytes Wrote 1 byte header and 103809101 encoded bytes in 450.853 seconds. real 0m0.454s user 0m0.333s sys 0m0.120s TreeCompress enwik8.tpre enwik8.tcom Allocated 1778384896 bytes for data processing Read 103809102 byte input file cap encoded: 1, UTF8 compliant 1 Found 103430933 symbols, 0 defines, maximum unicode value 0x28b46 1: 103430933 syms, dict. size 0, 4.9009 bits/sym, o0e 63362903 bytes Common prefix scan 34/34 79.0 - 2ffff.2ffff Run time 1034 seconds. real 0m1.038s user 0m0.888s sys 0m0.148s TreeBitEncode enwik8.tcom enwik8.tree cap encoded: 1, UTF8_compliant 1 Read 103430933 symbols including 0 definition symbols Encoded 103430933 level 1 symbols Compressed file size: 78293312 bytes, dictionary size: 6039 symbols elapsed time = 2181.430000 seconds. real 0m2.376s user 0m1.989s sys 0m0.196s TreeDecode enwik8.tree enwik8.out Decompressed 100000000 bytes in 11178482 msec real 0m5.594s user 0m11.080s sys 0m0.100s 97660 -rw-rw-r-- 1 matt matt 100000000 Sep 30 15:58 enwik8.out 101380 -rw-rw-r-- 1 matt matt 103809102 Sep 30 15:57 enwik8.tcom 101384 -rw-rw-r-- 1 matt matt 103809102 Sep 30 15:57 enwik8.tpre 76460 -rw-rw-r-- 1 matt matt 78293312 Sep 30 15:58 enwik8.tree a1fa5ffddb56f4953e226637dabbb36a enwik8 a1fa5ffddb56f4953e226637dabbb36a enwik8.out matt@matt-Latitude-E6510:~$ testtree enwik9 TreePreEncode enwik9 enwik9.tpre Pre and capital encoding 1000000000 bytes Wrote 1 byte header and 1041507612 encoded bytes in 6221.491 seconds. real 0m24.686s user 0m4.428s sys 0m1.795s TreeCompress enwik9.tpre enwik9.tcom Allocated 1778384896 bytes for data processing Read 1041507613 byte input file cap encoded: 1, UTF8 compliant 1 Found 1039028503 symbols, 0 defines, maximum unicode value 0x28b4e 1: 1039028503 syms, dict. size 0, 4.9718 bits/sym, o0e 645730267 bytes Common prefix scan 634/634 30fb.0 - 2ffff.2ffff Run time 6834 seconds. real 0m20.720s user 0m5.751s sys 0m1.085s TreeBitEncode enwik9.tcom enwik9.tree cap encoded: 1, UTF8_compliant 1 ERROR - symbol limit of 250000000 symbols exceeded real 0m4.531s user 0m2.404s sys 0m0.605s TreeDecode enwik9.tree enwik9.out fopen error file 'enwik9.tree' not found real 0m1.422s user 0m0.006s sys 0m0.000s 0 -rw-rw-r-- 1 matt matt 0 Sep 30 15:59 enwik9.out 1017104 -rw-rw-r-- 1 matt matt 1041507613 Sep 30 15:59 enwik9.tcom 1017104 -rw-rw-r-- 1 matt matt 1041507613 Sep 30 15:59 enwik9.tpre e206c3450ac99950df65bf70ef61a12d enwik9 d41d8cd98f00b204e9800998ecf8427e enwik9.out rm: cannot remove ‘enwik9.tree’: No such file or directory
Also, times are not calculated correctly by the programs. When I compile I get warnings about printf format errors (long passed as "%u") which matters because in 64 bit gcc, long is 64 bits.Code:echo TreePreEncode 1ドル 1ドル.tpre time TreePreEncode 1ドル 1ドル.tpre echo TreeCompress 1ドル.tpre 1ドル.tcom time TreeCompress 1ドル.tpre 1ドル.tcom echo TreeBitEncode 1ドル.tcom 1ドル.tree time TreeBitEncode 1ドル.tcom 1ドル.tree echo TreeDecode 1ドル.tree 1ドル.out time TreeDecode 1ドル.tree 1ドル.out ls -las 1ドル.* md5sum 1ドル 1ドル.out rm 1ドル.tpre 1ドル.tcom 1ドル.tree 1ドル.out
Let me get back to you. I just installed minGW64 last night and had to change some variables to 64 bits to get the code running.
I've seen that, too. With Cygwin you can use the POSIX APIs, so you can call mmap like in Linux. With MinGW, you'll have to find the native Win32 version and call that instead.Quote Originally Posted by Kennon Conrad View PostI saw this on minGW's site:http://www.mingw.org/node/21.
Now that I have the 64-bit compiler, I think I may have more options. But I'm not sure it would have much impact on TreeDecode since it already has a separate thread that reads the decoded symbols from a circular buffer, looks up the dictionary string and sends the data to output via ~1 MB ping-pong buffers.Quote Originally Posted by nburns View PostI've seen that, too. With Cygwin you can use the POSIX APIs, so you can call mmap like in Linux. With MinGW, you'll have to find the native Win32 version and call that instead.
Thanks for the detailed problem report. I am sorry trying to build a 64-bit executable didn't go very well. The attached version of TreeCompress has updated code and works when a 64-bit executable is built with MinGW64 and works with MinGW. I can't build 32-bit executables with MinGW64, but I think it is a setup problem. I do not know if it will work under Wine - I used %Iu in printf's. The pointers could use some clean-up work and this version uses the same memory size for 32 or 64 bit code. The MAX_MEMORY_USAGE const can be increased before making a 64-bit build if you want to try more memory, which does seem to work. The results are not exactly the same, probably because memory structure sizes have changed even though overall memory allocation is the same for the 32 and 64 bit executables.Quote Originally Posted by Matt Mahoney View PostI was testing tree v0.12 in 64 bit Linux. On a small text file (enwik6, 1 MB), TreeCompress.exe under Wine 1.6 gives an out of memory error. Next I compiled with gcc 4.8.2 -O3 and I get a segmentation fault. Then I tried options -O3 -m32 and finally it runs, but enwik8 gives fast but poor compression and enwik9 gives an error (symbol limit exceeded).
Also, times are not calculated correctly by the programs. When I compile I get warnings about printf format errors (long passed as "%u") which matters because in 64 bit gcc, long is 64 bits.
I compiled the new TreeCompress64 in Linux and it works.
enwik8 -> 21,974,704, 0.4+7299+3.4 sec, 1.8 sec. (2.67 GHz M620).
Kennon Conrad (2nd October 2014),surfersat (2nd October 2014)
Input:
1,000,000,000 - enwik9
Ouput:
Harddisk:
269,503,577 bytes, 20.647 sec. - 9.669 sec., Lzturbo -32 -p1
250,730,516 bytes, 1083.884 sec. - 11.688 sec., Lzturbo -29 -p1
202,241,389 bytes, 1124.972 sec. - 12.045 sec., Lzturbo -39 -p1
193,605,881 bytes, 1317.063 sec. - 19.420 sec., Lzturbo -49 -p1
177,321,380 bytes, xxxx.xxx sec. - 13.581 sec., TreeComp v0.12 -t1
SSD:
269,503,577 bytes, 19.497 sec. - 3.622 sec., Lzturbo -32 -p1
250,730,516 bytes, 1073.668 sec. - 3.672 sec., Lzturbo -29 -p1
202,241,389 bytes, 1114.174 sec. - 4.169 sec., Lzturbo -39 -p1
193,605,881 bytes, 1305.626 sec. - 11.493 sec., Lzturbo -49 -p1
177,321,380 bytes, xxxx.xxx sec. - 11.215 sec., TreeComp v0.12 -t1
RAM DISK:
269,503,577 bytes, 19.235 sec. - 1.981 sec., Lzturbo -32 -p1
250,730,516 bytes, 1069.845 sec. - 1.694 sec., Lzturbo -29 -p1
202,241,389 bytes, 1111.677 sec. - 2.477 sec., Lzturbo -39 -p1
193,605,881 bytes, 1302.485 sec. - 9.624 sec., Lzturbo -49 -p1
177,321,380 bytes, xxxx.xxx sec. - 11.240 sec., TreeComp v0.12 -t1
cleanmem before every run.
Kennon Conrad (2nd October 2014),Matt Mahoney (2nd October 2014),surfersat (2nd October 2014)
Input:
1,000,000,000 bytes - enwik9
Output:
Harddisk:
269,503,577 bytes, 13.549 sec. - 9.404 sec., Lzturbo -32 -p2
250,730,516 bytes, 1083.920 sec. - 11.536 sec., Lzturbo -29 -p2
202,241,389 bytes, 1126.518 sec. - 12.110 sec., Lzturbo -39 -p2
193,605,881 bytes, 1316.625 sec. - 19.524 sec., Lzturbo -49 -p2
177,321,380 bytes, xxxx.xxx sec. - 10.681 sec., TreeComp v0.12
SSD:
269,503,577 bytes, 11.220 sec. - 2.377 sec., Lzturbo -32 -p2
250,730,516 bytes, 1074.534 sec. - 3.560 sec., Lzturbo -29 -p2
202,241,389 bytes, 1114.699 sec. - 4.213 sec., Lzturbo -39 -p2
193,605,881 bytes, 1305.269 sec. - 11.452 sec., Lzturbo -49 -p2
177,321,380 bytes, xxxx.xxx sec. - 7.465 sec., TreeComp v0.12
RAM DISK:
269,503,577 bytes, 11.060 sec. - 1.269 sec., Lzturbo -32 -p2
250,730,516 bytes, 1071.082 sec. - 1.705 sec., Lzturbo -29 -p2
202,241,389 bytes, 1114.523 sec. - 2.398 sec., Lzturbo -39 -p2
193,605,881 bytes, 1303.752 sec. - 9.625 sec., Lzturbo -49 -p2
177,321,380 bytes, xxxx.xxx sec. - 7.436 sec., TreeComp v0.12
Kennon Conrad (5th October 2014),Paul W. (13th October 2014)
Tested tree 0.12 compiled with gcc 4.8.2 -O3 on enwik9 in 64 bit Ubuntu. TreeCompress gave a segmentation fault after 2.6 days but the output was still OK.
enwik9.tpre -> 1,041,507,613, 4.4 sec
enwik9.tcom -> 371,204,373, 226246 sec, gives segmentation fault
enwik9.tree -> 177,542,372, 29 sec (note different size)
enwik9.out -> 16.8 sec (verifies OK)
http://mattmahoney.net/dc/text.html#1775
Is this the same build that ran okay for enwik8? If so, that is odd.Quote Originally Posted by Matt Mahoney View PostTested tree 0.12 compiled with gcc 4.8.2 -O3 on enwik9 in 64 bit Ubuntu. TreeCompress gave a segmentation fault after 2.6 days but the output was still OK.
enwik9.tpre -> 1,041,507,613, 4.4 sec
enwik9.tcom -> 371,204,373, 226246 sec, gives segmentation fault
enwik9.tree -> 177,542,372, 29 sec (note different size)
enwik9.out -> 16.8 sec (verifies OK)
http://mattmahoney.net/dc/text.html#1775
I have not seen this problem on Windows. Since the output gets saved periodically during compression, there are two lines of code at the end of the file that get executed for the first time when the program finishes, both at the end of TreeCompress.c. The first is the free(char_buffer) command. The second is the fprintf for the execution time. I suspect the free call is what is causing the segmentation fault, but cannot be sure without more information or installing Ubuntu. If you still have the enwik9.tcom file, it would be interesting to try running it through TreeCompress again with and without the free command commented out.
The different/larger size was caused by the program having fewer tree nodes available since pointers are now 8 bytes instead of 4. It tries compensating by building the common prefix tree in smaller pieces, but in this case, there probably weren't enough nodes available to fit some of the individual pieces (for instance, it may have not been able to fit all the common prefixes starting with two spaces).
memalloc problems are greatly spotted by the valgrind
I didn't look at the source code but the end of output for TreeCompress looked like this:
I compiled the new version of TreeCompress for enwik9. I didn't test enwik8 again with it.Code:Common prefix scan 16/19 4d6238.0 - 57d925.78b78b Created 4791050 of 21753068 string nodes Common prefix scan 17/19 57d926.0 - 641ed3.78b78b Created 4773095 of 21753068 string nodes Common prefix scan 18/19 641ed4.0 - 731f64.78b78b Created 4756924 of 21753068 string nodes Common prefix scan 19/19 731f65.0 - 78b78b.78b78b Created 1638985 of 21753068 string nodes Common prefix scan 19/19 731f65.0 - 78b78b.78b78b, score[30000] = 22.65 /home/matt/bin/testtree: line 4: 19521 Segmentation fault (core dumped) TreeCompress 1ドル.tpre 1ドル.tcom real 3770m46.907s user 3757m50.924s sys 2m20.234s
It seems like since your program is designed for 32 bits you could use ints instead of pointers so the 64 bit version doesn't take more space.
Last edited by Matt Mahoney; 7th October 2014 at 01:01.
Kennon Conrad (7th October 2014)
Thanks, that's all I needed to understand what happened. The program crashed when building the new dictionary entries data structure for one-pass substitution of new dictionary entries because the pointers in the structure are twice as big as before.Quote Originally Posted by Matt Mahoney View PostI didn't look at the source code but the end of output for TreeCompress looked like this:
I compiled the new version of TreeCompress for enwik9. I didn't test enwik8 again with it.Code:Common prefix scan 16/19 4d6238.0 - 57d925.78b78b Created 4791050 of 21753068 string nodes Common prefix scan 17/19 57d926.0 - 641ed3.78b78b Created 4773095 of 21753068 string nodes Common prefix scan 18/19 641ed4.0 - 731f64.78b78b Created 4756924 of 21753068 string nodes Common prefix scan 19/19 731f65.0 - 78b78b.78b78b Created 1638985 of 21753068 string nodes Common prefix scan 19/19 731f65.0 - 78b78b.78b78b, score[30000] = 22.65 /home/matt/bin/testtree: line 4: 19521 Segmentation fault (core dumped) TreeCompress 1ドル.tpre 1ドル.tcom real 3770m46.907s user 3757m50.924s sys 2m20.234s
It seems like since your program is designed for 32 bits you could use ints instead of pointers so the 64 bit version doesn't take more space.
I'm not sure that 64 bit programs are guaranteed to allocate memory with addresses <4G. I have changed most of the structure pointers to structure indexes, which gets the memory usage back down but requires an address calculation when the index is used.
I will try to post an updated 64-bit version, but it won't be right away. I need to make a few more changes to make the process identical.
I don't see any way around that when building for 64-bit. Pointers will be 8 bytes.Quote Originally Posted by Kennon Conrad View PostThanks, that's all I needed to understand what happened. The program crashed when building the new dictionary entries data structure for one-pass substitution of new dictionary entries because the pointers in the structure are twice as big as before.
I'm not sure that 64 bit programs are guaranteed to allocate memory with addresses <4G. I have changed most of the structure pointers to structure indexes, which gets the memory usage back down but requires an address calculation when the index is used.
I will try to post an updated 64-bit version, but it won't be right away. I need to make a few more changes to make the process identical.
The problem with creating a 64-bit version of TreeCompress v0.12 is that it won't produce identical results without increasing RAM usage and/or run time. The main reason for going to 64-bit code was to provide more RAM for compression. One of these days though, I need to try making a 64-bit version of TreeDecode. Using 64 bit values would make the bit stream handling much easier.Quote Originally Posted by nburns View PostI don't see any way around that when building for 64-bit. Pointers will be 8 bytes.
Last edited by Kennon Conrad; 8th October 2014 at 10:15.
What's wrong with replacing pointers with 32-bit array indexes? That won't work if the array has more than 4G elements, but how often does that happen, anyway?Quote Originally Posted by Kennon Conrad View PostThe problem with creating a 64-bit version of TreeCompress v0.12 is that it won't produce identical results without increasing RAM usage and/or run time. The main reason for going to 64-bit code was to provide more RAM for compression. One of these days though, I need to try making a 64-bit version of TreeDecode. Using 64 bit values would make the bit stream handling much easier.
Array indexing might be slower than using pointers directly, but I'd try it before assuming it will be too slow. The machine code honestly might change less than you think and come out almost equivalent. Even if your assumptions are on target, it might cost less than you expect.
Last edited by nburns; 9th October 2014 at 15:39.
My only issue is that I think slowing down 32 bit code so that 64 bit code (that won't run faster) can produce the same results is silly. The time penalty is not that much.Quote Originally Posted by nburns View PostWhat's wrong with replacing pointers with 32-bit array indexes? That won't work if the array has more than 4G elements, but how often does that happen, anyway?
Array indexing might be slower than using pointers directly, but I'd try it before assuming it will be too slow. The machine code honestly might change less than you think and come out almost equivalent. Even if your assumptions are on target, it might cost less than you expect.
When AMD came out with the 64-bit extensions to x86, they were supposed to have other overdue improvements, like access to additional registers. So there may be other benefits to building for 64-bits.Quote Originally Posted by Kennon Conrad View PostMy only issue is that I think slowing down 32 bit code so that 64 bit code (that won't run faster) can produce the same results is silly. The time penalty is not that much.
For either my A8-5500 or i7-4790K with TreeCompress and TreeDecode using minGW, execution time has been longer with 64 bit executables. There are extra calculations for addresses that I haven't been able to get around yet. For TreeDecode, all I really want is to be able to shift 64 bit registers, but I can't seem to get there without ending up with a lot more overhead from what appears to be relative addressing.Quote Originally Posted by nburns View PostWhen AMD came out with the 64-bit extensions to x86, they were supposed to have other overdue improvements, like access to additional registers. So there may be other benefits to building for 64-bits.
That's possible. But it can be hard to tell where the problem really is without lots of digging and experimentation. The calculations happen on the CPU. Going out to memory is dramatically slower, so usually optimizing memory accesses is the biggest concern.Quote Originally Posted by Kennon Conrad View PostFor either my A8-5500 or i7-4790K with TreeCompress and TreeDecode using minGW, execution time has been longer with 64 bit executables. There are extra calculations for addresses that I haven't been able to get around yet. For TreeDecode, all I really want is to be able to shift 64 bit registers, but I can't seem to get there without ending up with a lot more overhead from what appears to be relative addressing.
Hi Kennon,
I am mainly reading in this forum since many years.
you have written one of the most exciting algorithms / programs of the last years in my opinion.
It's something original and in it's own league performance-wise in what it does.
Because of this, please focus in optimizing its strength instead of turning it to something we already have a dozen times.
If you would be able to cut compression time by half (which shouldn't be realistic) and loose like up to 20% compression this algorithm wouldn't be anything interesting any more.
Surely a fork of this with new ideas would be cool but please stick to this with the strengths it has. More compression, faster decompression and faster compression without loosing in one of the first two categories would be welcome. ;)
EDIT:
By the way why didn't anyone move this thread to general yet?
Last edited by Simon Berger; 12th October 2014 at 21:20.
Thanks for the feedback. I do intend to continue to optimize Tree's strengths, but also to investigate improving compession speed. I am convinced significantly faster algorithms that can produce files with nearly as much compression are possible. You are right, a 20% loss in compression would cause Tree to lose it's best/most unique quality, so that is not interesting to me.Quote Originally Posted by Simon Berger View PostHi Kennon,
I am mainly reading in this forum since many years.
you have written one of the most exciting algorithms / programs of the last years in my opinion.
It's something original and in it's own league performance-wise in what it does.
Because of this, please focus in optimizing its strength instead of turning it to something we already have a dozen times.
If you would be able to cut compression time by half (which shouldn't be realistic) and loose like up to 20% compression this algorithm wouldn't be anything interesting any more.
Surely a fork of this with new ideas would be cool but please stick to this with the strengths it has. More compression, faster decompression and faster compression without loosing in one of the first two categories would be welcome. ;)
EDIT:
By the way why didn't anyone move this thread to general yet?
I found a bug in TreeBitEncode that causes decoding of small files to crash, so it has been fixed. TreeDecode has been modified to use smaller and more appropriately sized variables, decreasing RAM usage and improving decoding speed by 3-4%.
Instead of making TreeCompress64 run like TreeCompress, but just slower, I went a completely different direction. TreeCompress64 is a very experimental version that merely represents a sidebranch of a path toward linear decision making. The intent is to provide an indication of how much the compression ratio will be impacted by making decisions using only partial data and using a new scoring formula more tailored to linear compression.
Main functional changes:
1. All input symbols are converted to Tree's internal 32-bit token codes when the input is read and converter back when the output is written rather than converting symbols on the fly when the common prefix trees are built. This causes the program to use significantly more RAM to store the data but allows the code to be simpler and improves execution speed.
2. Maximum RAM usage is approximately 6x the input file size rather than about 1.7 GB maximum.
3. Common prefix trees are not built in sections. Instead, the program builds as much of a common prefix tree as will fit in memory and makes tokenization decisions based on only the window of data that was used to build the tree.
4. The scoring formula has been adjusted to promote high frequency strings more/long strings less than before.
5. Token creation management only allows an average of about 3000 unusable symbols per cycle instead of 7500.
6. Adjustments for two instance matches that are near each other have been eliminated.
When compressing enwik9, the first 5 symbols created represent the following strings. The window size for the first decision cycle is about 19 MB. "C" is the capital encode symbol.
1:
</text>
</revision>
</page>
<page>
<title>C
2:
"e;
3:
</comment>
<text xml:space="preserve">
4:
Cz</timestamp>
<contributor>
<
5.
[http://www.
One other note is that my results have not been 100% consistent. I think they may depend on the runtime libraries.
Results for enwik8:
TreeCompress => 21,976,316 bytes, encode 3916 sec., decode 0.78 sec.
TreeParagraphs + TreeWords + TreeCompress => 22,075,700 bytes, encode 1906 sec., decode 0.76 sec.
TreeCompress64 => 22,196,288 bytes, encode 931 sec., decode 0.76 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 22,304,524 bytes, encode 730 sec., decode 0.74 sec.
Results for enwik9:
TreeCompress => 177,340,072 bytes, encode 116,473 sec., decode 7.5 sec.
TreeParagraphs + TreeWords + TreeCompress => 178,774,864 bytes, encode 36,652 sec., decode 7.3 sec.
TreeCompress64 => 180,516,660 bytes, encode 30,147 sec., decode 7.6 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 180,941,504 bytes, encode 20,834 sec., decode 7.4 sec.
TreeDecode.zip is 10,525 bytes.
Overall results are 45% - 75% faster compression at the cost of 1% - 2% of compressed file size and more RAM for enwik9, but less RAM for enwik8. This is encouraging for single pass compression because it appears that making tokenization decisions based on a small initial window does not hurt compression too much. This is significant for the algorithm I have in mind because I don't have nearly enough RAM to fit all of enwik9's common prefix tree in a memory until a lot of token substitutions have been made.
Do not try TreeCompress64 if your computer doesn't easily have 6x the file size available. It will pretty much freeze everything up.
A 64 bit shift should be faster with a 64 bit compile. The function shift() is 10 instructions in 32 bits and 3 instructions in 64 bit. I cheated a bit by making the shift amount the first parameter so it is already in RCX (at least in Windows). Otherwise there is an extra instruction to move it there.
Also, I posted your results to LTCB. http://mattmahoney.net/dc/text.html#1773Code:C:\tmp>type a.c unsigned long long shift(int x, unsigned long long a) { return a << x; } C:\tmp>gcc -O3 -S -m32 a.c C:\tmp>type a.s .file "a.c" .text .p2align 4,,15 .globl _shift .def _shift; .scl 2; .type 32; .endef _shift: movl 4(%esp), %ecx movl 8(%esp), %eax movl 12(%esp), %edx shldl %cl, %eax, %edx sall %cl, %eax testb 32,ドル %cl je L2 movl %eax, %edx xorl %eax, %eax L2: rep ret .ident "GCC: (rev5, Built by MinGW-W64 project) 4.8.1" C:\tmp>gcc -O3 -S -m64 a.c C:\tmp>type a.s .file "a.c" .text .p2align 4,,15 .globl shift .def shift; .scl 2; .type 32; .endef .seh_proc shift shift: .seh_endprologue movq %rdx, %rax salq %cl, %rax ret .seh_endproc .ident "GCC: (rev5, Built by MinGW-W64 project) 4.8.1"
Thanks. I would like to use 64 bit code for shifts, but the problem is that 64-bit code is expanding the code in lots of other places. For instance, the following codeQuote Originally Posted by Matt Mahoney View PostA 64 bit shift should be faster with a 64 bit compile. The function shift() is 10 instructions in 32 bits and 3 instructions in 64 bit. I cheated a bit by making the shift amount the first parameter so it is already in RCX (at least in Windows). Otherwise there is an extra instruction to move it there.
Also, I posted your results to LTCB. http://mattmahoney.net/dc/text.html#1773Code:C:\tmp>type a.c unsigned long long shift(int x, unsigned long long a) { return a << x; } C:\tmp>gcc -O3 -S -m32 a.c C:\tmp>type a.s .file "a.c" .text .p2align 4,,15 .globl _shift .def _shift; .scl 2; .type 32; .endef _shift: movl 4(%esp), %ecx movl 8(%esp), %eax movl 12(%esp), %edx shldl %cl, %eax, %edx sall %cl, %eax testb 32,ドル %cl je L2 movl %eax, %edx xorl %eax, %eax L2: rep ret .ident "GCC: (rev5, Built by MinGW-W64 project) 4.8.1" C:\tmp>gcc -O3 -S -m64 a.c C:\tmp>type a.s .file "a.c" .text .p2align 4,,15 .globl shift .def shift; .scl 2; .type 32; .endef .seh_proc shift shift: .seh_endprologue movq %rdx, %rax salq %cl, %rax ret .seh_endproc .ident "GCC: (rev5, Built by MinGW-W64 project) 4.8.1"
produces this when compiled for 32 bits:Code:temp_input_word = temp_input_word >> 21; symbol_bits += symbol_bits_table[temp_input_word];
and this when compiled for 64 bits:Code:shrl 21,ドル %esi movzbl _symbol_bits_table(%esi), %eax addb %al, _symbol_bits
Code:leaq symbol_bits_table(%rip), %rax shrl 21,ドル %ebx movzbl (%rax,%rbx), %eax addb %al, symbol_bits(%rip)
What's wrong with the 64-bit version? Does that code run slower? The difference is that the 64-bit version moves the table address into a register, and the 32-bit version uses it directly from memory. I'm not sure why, but it might reflect a deliberate change in the instruction set. The new version is more RISC-y.Quote Originally Posted by Kennon Conrad View PostThanks. I would like to use 64 bit code for shifts, but the problem is that 64-bit code is expanding the code in lots of other places. For instance, the following code
produces this when compiled for 32 bits:Code:temp_input_word = temp_input_word >> 21; symbol_bits += symbol_bits_table[temp_input_word];
and this when compiled for 64 bits:Code:shrl 21,ドル %esi movzbl _symbol_bits_table(%esi), %eax addb %al, _symbol_bits
Code:leaq symbol_bits_table(%rip), %rax shrl 21,ドル %ebx movzbl (%rax,%rbx), %eax addb %al, symbol_bits(%rip)
The 64-bit version of TreeDecode takes 7% longer to run. I suspect it is mostly from the extra instructions like the one above. There are a lot of them. I think it would more than wipe out the savings from being able to efficiently perform shifts on 64 bit values. I wonder if I need a different compiler.Quote Originally Posted by nburns View PostWhat's wrong with the 64-bit version? Does that code run slower? The difference is that the 64-bit version moves the table address into a register, and the 32-bit version uses it directly from memory. I'm not sure why, but it might reflect a deliberate change in the instruction set. The new version is more RISC-y.
Last edited by Kennon Conrad; 14th October 2014 at 09:36.