Please enlighten me, oh wise friend of mine...
All I did was to download the .zip file as provided by github at https://github.com/gameclosure/LZHAM and marked by the programmer as "Candidate alpha8 - still underground exhaustive testing".
I can't see any "debug version".
It compresses as expected and throws no debug info at all. But if you read carefuly you'll see that the project itself is a compression library and the .exe are examples of the capabilities of the algorithm, including proper compression like any other CLI packer.
In any case, it is a mistake made by someone trying to help. And be treated as stupid is not helpful at all.
Edit: Indeed it is a mistake. I am still a newbie. Next time correct me and I will thank you.
Here is the right one.
Last edited by Gonzalo; 10th October 2014 at 06:36.
I'm not sure if it's any newer, but that looks to be a fork of the original LZHAM which can be found at http://code.google.com/p/lzham/
Either was alpha8 is still the latest so likely they're the same (albeit confusing to have two copies): "alpha8 - Feb. 2, 2014 - On SVN only: Project now has proper Linux cmake files. Tested and fixed misc. compiler warnings with clang v3.4 and gcc v4.8, x86 and amd64, under Ubuntu 13.10. Added code to detect the # of processor cores on Linux, fixed a crash bug in lzhamtest when the source file was unreadable, lzhamtest now defaults to detecting the # of max helper threads to use."
Sorry for my harshness. Sometimes I'm too impulsive and my typing hands are faster than brain :)Quote Originally Posted by Gonzalo View PostIn any case, it is a mistake made by someone trying to help. And be treated as stupid is not helpful at all.
Edit: Indeed it is a mistake. I am still a newbie. Next time correct me and I will thank you.
That's ok. Don't worry.Sorry for my harshness. Sometimes I'm too impulsive and my typing hands are faster than brain :)
You are right. It is newer since there are at this time a few new commits made on February 2014. But they are not related to the compression engine, so I think we can still use the binaries provided.I'm not sure if it's any newer, but that looks to be a fork of the original LZHAM which can be found at http://code.google.com/p/lzham/
I tested the x64 version of lzham alpha7_r1 with enwik8 and enwik9 on my 4GHz i4790K processor. Results are as follows:
enwik8:
Compression:
lzhamtest_x64 c: 24,794,784 bytes in 22.9 seconds (135 seconds process time); 903 MB memory
Decompression:
lzhamtest_x64 d: 0.55 seconds - process time (0.60 seconds global time)
enwik9:
Compression:
lzhamtest_x64 c: 205,091,362 bytes in 274 seconds (1536 seconds process time); 2392 MB memory
Decompression:
lzhamtest_x64 d: 4.9 seconds - process time (5.0 seconds global time)
That's a better compression ratio with compression in less than half the time and decompression in just over half the time than the versions shown on LTCB. Very impressive!
I've finally released v1.0 on github:
https://github.com/richgel999/lzham_codec
I know it took me ~3 years to get a real release up. But I had a lot of things going on, like working on Portal 2, DoTA 2, then shipping all the Source engine games on Linux, so I had my hands really full.
This version is not compatible with bitstreams generated with the alphas. I'm promising not to change the bitstream format for v1.x releases, except for critical bug fixes.
I would like to thank everyone here: I read these forums as a lurker before working on LZHAM, and I studied every LZ related post I could get my hands on. Especially anything related to LZ optimal parsing, which still seems like a black art. LZHAM was my way of learning how to implement optimal parsing (and you can see this if you study the progress I made in the early alphas on Google Code).
Notable changes from the prev. alphas:
- Added full OSX and iOS support (tested on various iPhone 4, 5, and 6+ models). Working on Android support next, which I need for our products.
I still need to merge over the XCode project, and enhance the cmake files to support platforms other than Linux.
Now that I am using this on real products at work I'll be able to dedicate more time to the codec.
- Reduced decompressor's memory consumption and init times by greatly slashing the total # of Huffman and arithmetic tables (from hundreds down to <10), which also increased the decompressor's throughput and speed stability on mostly uncompressible files.
This allowed for a slight reduction in the decompressor's complexity, because it now doesn't need to track the previous 2 output bytes.
- Further reduced the decompressor's up front initialization time cost, by precalculating several large encoding tables. It's still more expensive than I would like, I think due to the init memory allocs and Huffman table initializations.
- Added tuning options to allow the user to control the Huffman table update frequency. It defaults to an update interval that is much less frequent vs. the alphas.
- Ratio seems very slightly improved from the prev. alpha on the test files I've looked at (but this was not my primary intention). I've focused more on decompression perf., lowering the init times, and lowering the decompressor's memory footprint, and iOS/OSX support, not ratio. Ratio may be slightly lower on some files due to the v1.0 Huffman and modeling changes.
- I've extensively profiled and documented up front when it's not worth using LZHAM vs. LZMA. On a Core i7 Windows x64, if the # of compressed bytes is < ~13,000, LZMA is typically faster to decode. On high end iOS devices, the compressed size threshold is around 1KB (and I'm not sure why there's such a large difference yet). I believe this has to do with LZHAM's more expensive init cost vs. LZMA, and maybe due to LZHAM's frequent Huffman table updating at the beginning of streams.
Last edited by rgeldreich; 26th January 2015 at 05:52.
Intrinsic (29th January 2015)
Some enwik8/9 statistics with v1.0 (Windows, x64 executable, Core i7 Gulftown 3.3 GHz).
-- enwik9 (512MB dictionary):
Normal parsing: 204,325,043
Compression Time: 339.7 secs, Decompression Time: 6.62 secs (151,045,672 bytes/sec)
"Extreme" parsing (up to 4 LZ decisions per graph node, lzhamtest -x option): 202,237,199
Compression Time: 1096.62 secs, Decompression Time: 6.59 secs (151,744,359 bytes/sec)
-- enwik8 (128MB dictionary):
Normal parsing: 25,091,033
Compression Time: 27.95 secs, Decompression Time: .73 secs (136,920,702 bytes/sec)
Extreme parsing: 24,990,739
Compression Time: 72.21 secs, Decompression Time: .72 secs (137,963,425 bytes/sec)
I updated LTCB but I had to guess at memory usage. A Gulftown is 6 cores, so I guessed 1.5x memory. http://mattmahoney.net/dc/text.html#2024
how did you compile that using MinGW?
In the past I've compiled LZHAM with TDM-GCC x64 (using Codeblocks as an IDE), and it worked well. For the v1.0 release I've tested it with VS 2010 and 2013 so far.
FYI: news article in Japanese found: http://news.mynavi.jp/news/2015/01/27/076/
is there a downloadable binary from this version 1.0 for windows 32bit or windows 64 bit?
has someone a working link ?
I can't fill in the x 64 version. Someone can put online for download.
Hey Rich, very impressive!
I'm curious what your simplified method for sending literals/delta-literals is?
Are you using context bits for literals at all? Are you doing the funny LZMA rep-lit exclusion thing?
Last edited by cbloom; 3rd August 2016 at 21:34.
Also, if you have the permission to release any of your game test files publicly, I think that would help the community a lot.
As you correctly noted, people spend too much time on text and not enough on generic binary data. Part of the reason is there aren't good test sets for the type of binary data that we see.
On this type of binary data, LZMA usually beats PAQ (and NanoZip beats LZMA)
I've got some private collections of test data but haven't got the permission from clients to release it publicly.
I posted one here : (lzt24)
https://drive.google.com/file/d/0B-y...lhS0hrdVE/edit
but we need a lot more, and bigger.
Last edited by cbloom; 3rd August 2016 at 21:34.
Hello,
lzhamtest_win32.7z
lzhamtest_win64.7z
Should work...
Edit: win64 correction and win32 added.
AiZ
Last edited by AiZ; 29th January 2015 at 00:41.
Bulat Ziganshin (29th January 2015),Nania Francesco (29th January 2015),Stephan Busch (29th January 2015)
dear AiZ, the lzhamtest_win64.7z contains the 32-bit version.. We cannot choose larger dictionary than -d26
Hi Stephan,
I've downloaded lzhamtest_win64.7z from here and it's Ok, please check your downloads.
AiZ
Stephan Busch (29th January 2015)
Thanks, I'm not doing anything fancy with literals/delta literals at all now. For literals/delta literals v1.0 just uses two plain Huffman tables with no context, because the cost (in memory, init time, and decompression throughput predictability) was more than I was comfortable with. (I bit off more than I could chew with all those tables.) So it now only uses 8 total Huffman tables, vs. the previous ~134 (!):Quote Originally Posted by cbloom View PostHey Rich, very impressive! I'm curious what your simplified method for sending literals/delta-literals is? Are you using context bits for literals at all? Are you doing the funny LZMA rep-lit exclusion thing?
quasi_adaptive_huffman_data_model m_lit_table; // was [64] in the alphas, 3 MSB's each from the prev. 2 chars
quasi_adaptive_huffman_data_model m_delta_lit_table; // was [64] in the alphas, 3 MSB's each from the prev. 2 chars
quasi_adaptive_huffman_data_model m_main_table;
quasi_adaptive_huffman_data_model m_rep_len_table[2]; // index: cur_state >= CLZDecompBase::cNumLitStates
quasi_adaptive_huffman_data_model m_large_len_table[2]; // index: cur_state >= CLZDecompBase::cNumLitStates
quasi_adaptive_huffman_data_model m_dist_lsb_table;
I also reduced the total # arithmetic tables. m_is_match_model's context no longer includes any prev. character context bits. It's now just the current LZMA state index:
adaptive_bit_model m_is_match_model[CLZDecompBase::cNumStates];
adaptive_bit_model m_is_rep_model[CLZDecompBase::cNumStates];
adaptive_bit_model m_is_rep0_model[CLZDecompBase::cNumStates];
adaptive_bit_model m_is_rep0_single_byte_model[CLZDecompBase::cNumStates];
adaptive_bit_model m_is_rep1_model[CLZDecompBase::cNumStates];
adaptive_bit_model m_is_rep2_model[CLZDecompBase::cNumStates];
I did implement the approach of letting the user configure the total # of literals/delta_literal context bits, and to also allow the user to configure the bitmasks+shift offsets to use on the prev. X characters to compute the context. So the user could choose no context bits, like v1.0, or 3+3 bits like the alphas, or some combination of 1-8 bits from the prev. char, or a mix of the prev. 2 chars, etc. (The idea was this would allow the user to find the optimal settings to use for their data, just like LZMA lets you do with its lc/lp/pb settings which I've found to be very useful.) All this was starting to get too complex, so the KISS principle won out.
Not sure if I understand what you mean by LZMA rep-lit exclusion (I'll reread your notes on LZMA again).
I've encountered the same thing. I recently tried several PAQ based compressors on our Unity game data and LZMA was better. My current title is a ~166 MB mix of PVRTC or ETC textures, meshes, animations, MP3 or OGG music/sound effects, and tons of misc. binary serialized object data. The best open source codec I've found for our data is LZMA (counting only ratio).Quote Originally Posted by cbloom View PostAs you correctly noted, people spend too much time on text and not enough on generic binary data. Part of the reason is there aren't good test sets for the type of binary data that we see.
On this type of binary data, LZMA usually beats PAQ (and NanoZip beats LZMA)
I've got some private collections of test data but haven't got the permission from clients to release it publicly..
Rights are a tricky subject - I'll poke around and see what we could publically release.
When trying LZHAM_x64 on the App testset, the following error occurs:
D:\TESTSETS>lzhamtest -m4 -e -t4 -d29 c D:\TESTSETS\TEST_App\ app.lzham
Error: Too many filenames!
I also tried to put the input directory in quotes. What am I doing wrong here?
try to use "a" modeQuote Originally Posted by Stephan Busch View PostWhen trying LZHAM_x64 on the App testset, the following error occurs:
D:\TESTSETS>lzhamtest -m4 -e -t4 -d29 c D:\TESTSETS\TEST_App\ app.lzham
Error: Too many filenames!
I also tried to put the input directory in quotes. What am I doing wrong here?
Last edited by ivan2k2; 30th January 2015 at 00:44.
Well, is your "delta_lit" table just encoding the xor of the predicted symbol with the actual symbol?Quote Originally Posted by rgeldreich View PostNot sure if I understand what you mean by LZMA rep-lit exclusion (I'll reread your notes on LZMA again).
Also, do you do independent updates of the 8 Huffman tables? I see you have those parameters for the huffman update interval, but can encoder detect that a table hasn't changed much and only update a partial subset of the tables? Have you done any work on optimizing where the update locations are?
Last edited by cbloom; 3rd August 2016 at 21:34.
Quote Originally Posted by Stephan Busch View PostWhen trying LZHAM_x64 on the App testset, the following error occurs:
D:\TESTSETS>lzhamtest -m4 -e -t4 -d29 c D:\TESTSETS\TEST_App\ app.lzham
Error: Too many filenames!
I also tried to put the input directory in quotes. What am I doing wrong here?
Me too i have the same Error: Too many filenames!
but using the command a
a for folders
c for files
lzham -m4 -e -t8 -d29 a D:\TEST_64\* test_64.lzham
if i don't set the output file generate "__comp_temp_2920560304__.tmp"
So i try adding to freearc in arc.ini and works to compress folders.
[External compressor:lzham]
packcmd = lzham -m4 -d29 -t8 c $$arcdatafile$$.tmp $$arcpackedfile$$.tmp
unpackcmd = lzham d $$arcpackedfile$$.tmp $$arcdatafile$$.tmp
but i dont know ho to use the delta option in lzham. [-afilename (what file i need?)]
this gives the same error
Yes, for delta_lit's (literals immediately following a match) it encodes the xor of the predicted byte (what I call the "mismatch byte") with the actual byte (let's call this the delta byte).Quote Originally Posted by cbloom View PostWell, is your "delta_lit" table just encoding the xor of the predicted symbol with the actual symbol?
Also, do you do independent updates of the 8 Huffman tables? I see you have those parameters for the huffman update interval, but can encoder detect that a table hasn't changed much and only update a partial subset of the tables? Have you done any work on optimizing where the update locations are?
The match finder tries to be smart about deciding which matches of each length to return to the parser. There are typically a large # of possible matches it could return of a given length, so this gives the finder some freedom to be picky: When it encounters matches of equal length, it'll choose the one with the lowest match bucket. If the 2 matches fall into the same bucket, it then favors the match which has the lowest # of set bits in the delta byte. (The actual logic is a little more complex, but that's the gist of it. These rules only apply to matches >= 3 bytes. len2 matches are treated specially and I think the finder isn't as picky about them right now.)
Also, when choosing between two matches of equal length and match slot, the finder favors the match with the lowest value in the least significant 4 bits of the distance, because the 4 distance LSB's are separately coded into another Huffman table. I remember finding this to be a small win on some binary files, and it was cheap to add.
Yes, the Huff tables are all independently updated, but the update schedules are the same for all tables. The user can tweak the max # of symbols between updates, and the rate at which the update interval grows over time. The huff tables only use 16-bit sym frequency counts so that ultimately limits how long a table can go between updates. The tables are always entirely updated (big hammer approach - nothing fancy).
cbloom (30th January 2015)
Sorry about that, lzhamtest is really just a simple low-level testbed. I've integrated LZHAM into 7zip's 7za command line tool and GUI for higher level testing, so I should probably just release that. Or I could integrate LZHAM into somebody else's open source compression tool - but which one?Quote Originally Posted by GOZARCK View PostMe too i have the same Error: Too many filenames!
but using the command a
a for folders
c for files
lzham -m4 -e -t8 -d29 a D:\TEST_64\* test_64.lzham
if i don't set the output file generate "__comp_temp_2920560304__.tmp"
So i try adding to freearc in arc.ini and works to compress folders.
[External compressor:lzham]
packcmd = lzham -m4 -d29 -t8 c $$arcdatafile$$.tmp $$arcpackedfile$$.tmp
unpackcmd = lzham d $$arcpackedfile$$.tmp $$arcdatafile$$.tmp
but i dont know ho to use the delta option in lzham. [-afilename (what file i need?)]
Anyhow, the "a" option just compresses a bunch of files from a directory (to *temporary* compressed files), and the "c" option just compresses a single input file to an output compressed file. The "d' option decompresses one file to another.
GOZARCK (30th January 2015)
I think a 7-Zip plugin would be great
Ah, yeah. Good ideas I hadn't thought of.Quote Originally Posted by rgeldreich View PostIf the 2 matches fall into the same bucket, it then favors the match which has the lowest # of set bits in the delta byte. (The actual logic is a little more complex, but that's the gist of it. These rules only apply to matches >= 3 bytes. len2 matches are treated specially and I think the finder isn't as picky about them right now.)
Also, when choosing between two matches of equal length and match slot, the finder favors the match with the lowest value in the least significant 4 bits of the distance, because the 4 distance LSB's are separately coded into another Huffman table. I remember finding this to be a small win on some binary files, and it was cheap to add.
There's a huge amount of offset redundancy sometimes, so the encoder can use its freedom to choose which offset.
In theory you should exclude *every* literal that comes after that same match substring, not just the one that comes after your particular offset. That's too slow so instead you can exclude the *best* literal.
In fact the encoder can see the actual literal that occurs after the match, and could choose the offset such that the literal-after-match xor with the actual literal has the fewest bits set, or is coded in the lowest cost.
Last edited by cbloom; 3rd August 2016 at 21:33.
@rgeldreich
* "I've integrated LZHAM into 7zip's 7za command line tool" * ???
can you please this explain ?
A 7za.exe , which can produce a *.7z - file and inside using the lzham - compression algorithm ?
best regards