Results for ppmz2 v0.7 on the Calgary Corpus :
bib : 111261 -> 23873 = 1.717 bpc book1 : 768771 -> 210952 = 2.195 bpc book2 : 610856 -> 140932 = 1.846 bpc geo (de-interleaved) : 102404 -> 52446 = 4.097 bpc news : 377109 -> 103951 = 2.205 bpc obj1 : 21504 -> 9841 = 3.661 bpc obj2 : 246814 -> 69137 = 2.241 bpc paper1 : 53161 -> 14711 = 2.214 bpc paper2 : 82199 -> 22449 = 2.185 bpc pic (transposed) : 513216 -> 30814 = 0.480 bpc progc : 39611 -> 11178 = 2.258 bpc progl : 71646 -> 12938 = 1.445 bpc progp : 49379 -> 8948 = 1.450 bpc trans : 93695 -> 14224 = 1.214 bpc total : 726400 average : = 2.086 bpc
On many of the "medium text-like" files (eg. progl, trans, news) we even beat the best switching schemes of Volf!! This is at first quite an astounding result, because our computational complexity is orders of magnitude lower. However, we should perhaps not be surprised. Volf's best switching/mixing schemes mix between CTW and something like PPM* or LZ77. In both cases, he is simply trying to mix the good low-order performance of CTW with a good high-order coder. We, on the other hand, have got such a hybrid automatically by using infinite-length PPM and LOE. If you like, the PPMDet and LOE scheme can be seen as a way of using the local character to switch algorithms (in my case, between PPMDet and PPMZ-finite-order), instead of weighting all possible switches by their performance.
Note that on the very small files (obj1,progc) we're worse than the old ppmz, but we don't really care about that.
Note that this means I can now send the corpus self-extracting (separate files, with a totally generic 1->1 coder) in about 759,000 bytes. Hey Leonid - you owe me fifty bucks :^)
Update 2004 - Malcolm's got a new semi-commercial company that's making a nice version of RKive : M Software
This code is covered by the Bloom Public License (you need to read it if and only if you're an industrialist).
To compile PPMZ you must download crblib , and create a directory in your INCLUDE path called crbinc , so that crbinc/inc.h , etc. can be found by PPMZ.
release notes
PPMZ v9.0 report of results
PPMZ v8.1 report of results
PPMZ v7.6 report of results
PPMZ v7.3 algorithm description
PPMZ v7.3 report of results
See release notes below for more info.
A report on the files of the Calgary Corpus
Of most immediate interest is Bill Teahan's PPMD+ , which was the starting point for PPMZ.
The fact that something better was possible was suggested by Cleary and Teahan's paper on escape estimation
PPMdet was inspired by the PPM* coder by Cleary,Witten, and Teahan : PPM* paper and the LZ77 context index structures of Peter Fenwick.
Most recently, a weighting method has been inspired by the work of the CTW algorithm.
As always, a good summary of the original PPM algorithms can be found in:
T. Bell, J. Cleary, and I. Witten, Text Compression , Prentice Hall, 1990
Charles Bloom / cb at my domain Send Me Email
The free web counter says you are visitor number and visitor number to this section.