This is a small project to implement Philip Gage's Byte Pair Encoding compression for the purpose of learning C. In particular, it's written in C89 for fun as that's what would've been contemporary (though I really miss some nice C99 features). Disclaimer: To start the project, I used his provided C code as a reference, but decided early on to rewrite without looking at it. So there will be similarities to his original code in certain design decisions.
Here are details I'd maybe improve:
- I tested the programs on some small text and binary files (1 MB), but some bugs still might remain. Correctness is the most important here.
- A voluminous amount of debug info is generated which can't be turned off. It was helpful for debugging by writing it to a log file. My ideas were a macro to disable printf, or #ifdef sections to only print with a DEBUG flag defined.
- Compress should work with any input file, but expand doesn't do error checking on a compressed file. I'm not concerned with extreme cases like printf failing or running out of disk space. I didn't dynamically allocate anything so those are not an issue.
- I don't find my code easy to read, but I also don't read C much at all. Some functions and blocks may need to be re-organized. (I use Allman indentation style even though almost all C code online is in K&R style. In fact I couldn't find any large C codebases that used Allman style, but I assume it is reasonable.)
Implementation details: I picked BPE because I thought it was a simple compression algorithm. It is in theory, but fundamentally limited by the byte as a unit and number of unused byte values available. If a block happens to start with many different byte values, then the block will be short with little compression. The pair table RLE is fiddly (I didn't implement it exactly) and limited in size, while a Huffman code can be as large as needed. Encoding can't be fast either, because my program took dozens of passes over each block, while I believe Huffman coding can be done in one or two passes over a larger block.
compress.c
/* Byte Pair Encoding compression
Based on idea and code from Philip Gage
Pseudocode:
While not end of file
Read next block of data into buffer and
enter all pairs in hash table with counts of their occurrence
While compression possible
Find most frequent byte pair
Replace pair with an unused byte
If substitution deletes a pair from buffer,
decrease its count in the hash table
If substitution adds a new pair to the buffer,
increase its count in the hash table
Add pair to pair table
End while
Write pair table and packed data
End while
*/
#include <stdio.h>
#include <assert.h>
#define BLOCKSIZE 5000 /* Maximum block size */
#define MAXCHARS 200 /* Charset per block (leave some unused) */
#define MINPAIRS 3 /* Min pairs needed for compress */
typedef unsigned char uchar;
/* file-wide globals per-block */
uchar buffer[BLOCKSIZE];
uchar left[256], right[256]; /* pair table */
uchar count[256][256]; /* pair counts */
int size;
/* read block from file into pair count */
/* return true if not done reading file */
int readblock(FILE* infile)
{
int i, j, c;
int used = 0;
printf("*** READ BLOCK ***\n");
/* reset counts and pair table */
for (i = 0; i < 256; ++i)
{
for (j = 0; j < 256; ++j)
count[i][j] = 0;
left[i] = i;
right[i] = 0;
}
size = 0; /* block size */
/* C I/O, get one char at a time */
/* stopping at EOF, BLOCKSIZE limit, or MAXCHARS limit */
while (size < BLOCKSIZE && used < MAXCHARS)
{
/* only read now instead of in while condition */
c = getc(infile);
if (c == EOF) break;
if (size > 0)
{
uchar lastc = buffer[size-1];
/* count pairs without overflow */
if (count[lastc][c] < 255)
++count[lastc][c];
}
/* increase used count if new, mark in pair table as used */
if (right[c] == 0)
{
right[c] = 1;
++used;
}
buffer[size++] = c; /* push c to buffer */
}
printf("size: %d used: %d\n", size, used);
printf("buffer:\n");
for (i=0; i<size; ++i)
printf("%02x ", buffer[i]);
printf("\n(non-zero) count table:\n");
for (i=0; i<256; ++i)
for (j=0; j<256; ++j)
if (count[i][j])
printf("%02x%02x:%02x\t", i, j, count[i][j]);
printf("\n");
return (c != EOF);
}
/* for block, write pair and packed data */
void compress()
{
int pass, i, j, y;
printf("*** COMPRESS BLOCK ***\n");
/* while compression possible:
pick pairs until no unused bytes or no good pairs */
for (pass = 1; ; ++pass)
{
int r = 0, w = 0; /* read and write index */
uchar bestcount = 0;
uchar bestleft = 0, bestright = 0;
printf("COMPRESSION PASS %d\n", pass);
for (i=0; i<256; ++i)
{
for (j=0; j<256; ++j)
{
/* record best pair and count */
if (count[i][j] > bestcount)
{
bestcount = count[i][j];
bestleft = i; bestright = j;
}
}
}
printf("best pair %02x%02x:%d\n", bestleft, bestright, bestcount);
if (bestcount < MINPAIRS)
break;
/* find unused byte to use */
for (y=255; y>=0; --y)
if (left[y] == y && right[y] == 0)
break;
if (y < 0) break; /* no more unused */
printf("unused byte: %02x\n", y);
/* replace pairs with unused byte in-place in buffer */
while (r < size)
{
/* match best pair */
if (r+1 < size &&
buffer[r] == bestleft && buffer[r+1] == bestright)
{
buffer[w++] = y; /* write new byte */
r += 2; /* move read index past pair */
}
else
{
/* copy buffer[r] to buffer[w], increment indexes */
buffer[w++] = buffer[r++];
}
}
size = w; /* adjust written buffer size */
/* TODO: update counts during writing instead */
/* recreate count table */
for (i = 0; i < 256; ++i)
for (j = 0; j < 256; ++j)
count[i][j] = 0;
for (i=0; i<size; ++i)
{
if (i+1 < size)
{
uchar c = buffer[i];
uchar d = buffer[i+1];
if (count[c][d] < 255)
++count[c][d];
}
}
printf("new buffer(%d): ", size);
for (i=0; i<size; ++i)
printf("%02x ", buffer[i] );
printf("\n");
/* add pair in pair table */
left[y] = bestleft;
right[y] = bestright;
printf("used pair table:\n");
for (i=0; i<256; ++i)
{
if (i != left[i])
printf("%02x:%02x%02x\n", i, left[i], right[i]);
}
printf("\n");
}
printf("\n");
}
/* write pair table and compressed data */
void writeblock(FILE* outfile)
{
int c = 0;
signed char count = 0;
printf("*** WRITE BLOCK ***\n");
while (c < 256)
{
printf("c: %02x\n",c);
count = 0;
/* run of non-pairs */
if (c == left[c])
{
while (c == left[c] && c < 256 && count > -128)
{
++c;
--count;
}
/* output count as negative byte */
assert(count < 0);
putc(count, outfile);
printf("count:%d\n", count);
/* output single pair if not end of table */
if (c < 256)
{
putc(left[c], outfile);
putc(right[c], outfile);
printf("single pair %02x%02x\n", left[c], right[c]);
++c;
}
}
else
{
/* run of pairs */
int b = c; /* index of start of run */
while (c != left[c] && c < 256 && count < 127)
{
++c;
++count;
}
/* output count */
assert(count > 0);
putc(count, outfile);
printf("count:%d\n", count);
for (; b < c; ++b)
{
putc(left[b], outfile);
putc(right[b], outfile);
printf("%02x%02x\n", left[b], right[b]);
}
}
}
/* write compressed buffer size */
assert(size <= 0xFFFF);
putc(size >> 8, outfile);
putc(size & 0xFF, outfile);
printf("compressed size: %d (%04x)\n", size, size);
/* write compressed buffer */
fwrite(buffer, 1, size, outfile);
printf("write buffer\n");
}
int main(int argc, char* argv[])
{
int notdone;
FILE* infile, * outfile;
if (argc != 3)
{
printf("Usage: compress infile outfile\n");
return -1;
}
infile = fopen(argv[1], "r");
outfile = fopen(argv[2], "w");
if (infile == NULL)
{
printf("bad infile\n");
return -1;
}
if (outfile == NULL)
{
printf("bad outfile\n");
return -1;
}
notdone = 1;
while (notdone)
{
notdone = readblock(infile);
compress();
writeblock(outfile);
}
fclose(infile);
fclose(outfile);
return 0;
}
expand.c
/* BPE expand routine
pseudocode:
While not end of file
Read pair table from input
While more data in block
If stack empty, read byte from input
Else pop byte from stack
If byte in table, push pair on stack
Else write byte to output
End while
End while
*/
#include <stdio.h>
#include <assert.h>
unsigned char left[256], right[256];
unsigned char stack[256]; /* overflow? */
/* expand block */
/* return true if there are more blocks (doesn't end in EOF) */
int expand(FILE* infile, FILE* outfile)
{
int c, i, sp = 0;
signed char count;
int b = 0, usize, lsize, size;
/* reset pair table to defaults */
for (i = 0; i < 256; ++i)
{
left[i] = i;
right[i] = 0;
}
/* read compressed pair table */
while(b < 256) /* b = current table index */
{
c = getc(infile);
if (c == EOF) return 0; /* last block */
count = (signed char)c;
printf("b: %d Count: %d\n", b, count);
assert(count != 0);
/* negative count: skip forward by |count| then read a pair */
if (count < 0)
{
b += -count;
/* if not end table, read single pair */
if (b < 256)
{
/* doesn't handle if file unexpectedly ends */
left[b] = getc(infile);
right[b] = getc(infile);
printf("Read single pair %02x%02x\n", left[b], right[b]);
++b;
}
}
else /* positive count: read |count| pairs */
{
int b_end = b + count;
for (; b < b_end; ++b)
{
left[b] = getc(infile);
right[b] = getc(infile);
printf("Read pair %02x%02x\n", left[b], right[b]);
}
}
}
assert(b == 256); /* counts valid */
printf("Pair table:\n");
for (b = 0; b < 256; ++b)
{
printf("%02x:%02x%02x\t", b, left[b], right[b]);
}
printf("\n");
/* read compressed buffer size */
usize = getc(infile);
lsize = getc(infile);
size = (usize << 8) + lsize;
printf("size: %d(%02x%02x)\n", size, usize, lsize);
/* write output, pushing pairs to stack */
i = 0;
while (i < size || sp) /* more to read or stack non-empty */
{
if (sp == 0) /* stack empty */
{
c = getc(infile); /* read byte */
printf("read byte: %02x\n", c);
++i;
}
else
{
c = stack[--sp]; /* pop byte */
printf("pop byte: %02x\n", c);
}
if (c != left[c]) /* pair in table */
{
/* push pair */
stack[sp++] = right[c];
stack[sp++] = left[c];
printf("push pair %02x%02x\n", left[c], right[c]);
}
else /* pair not in table */
{
putc(c, outfile); /* write literal byte */
printf("write byte %02x\n", c);
}
}
return 1; /* try another block */
}
int main(int argc, char* argv[])
{
FILE* infile, * outfile;
int notdone;
if (argc != 3)
{
printf("Usage: expand infile outfile\n");
return -1;
}
infile = fopen(argv[1], "r");
outfile = fopen(argv[2], "w");
if (infile == NULL)
{
printf("bad infile\n");
return -1;
}
if (outfile == NULL)
{
printf("bad outfile\n");
return -1;
}
/* process blocks */
notdone = 1;
while (notdone)
notdone = expand(infile, outfile);
fclose(infile);
fclose(outfile);
return 0;
}
For fun, I made a simple Makefile to show how to compile.
all: compress expand
compress: compress.c
gcc -O2 -Wall -Wextra -std=c89 -pedantic -o compress compress.c
expand: expand.c
gcc -O2 -Wall -Wextra -std=c89 -pedantic -o expand expand.c
2 Answers 2
- Recommend putting filename inside the source code file(s).
- The link to the online doco should appear in the source code, too.
10,000m
Code looks clean at first glance.
Whitespace and structure look promising.
And, there is pseudocode in the source code, too!
Sticking to C89 standard is alright if that's something you want to try. When she was 3, my daughter would always hop with her right leg when crossing our driveway. Not everything has to make sense.
5,000m
Thirty-five lines of very similar code for main()
in both files.
There are many ways to write this functionality.
- Perhaps a single
main()
(~35 LoC) should be in its own source file that is compiled & linked with each of the two "operation" blocks of code each left in their source files. - Perhaps all functionality should be moved into one file so there is a single source file for entire project.
- Perhaps there's no need for a makefile. Perhaps the compile command line(s) could appear in the source code, too. Copy/paste/execute is easy-as-pie...
Using the preprocessor and a token-or-two, the single source file could still be used to generate two different executables, each suitable for its purpose.
While looking at main()
, recognise that the functionality could be achieved (perhaps better!) by designing the project to act as a 'filter':
cat file0.txt | compress | expand > file1.txt // perform both operations
// or...
expand < file2.bpe | grep "some text" // perform a "lookup" of something
This presumes that the "print debugging" statements, currently going to stdout
, have been sanitised/neutralised.
(Note, too, the use of ".bpe"
as the file extension.
Contents of a compressed file are mostly a meaningless collection of bytes until they are "expanded".)
There's a possible net reduction of 60+ lines of code. Unnecessary code (and its comments) provides more breeding grounds for unnecessary bugs.
2,000m
It's the coder's choice, therefore not wrong, but that "Allman style" makes it difficult for an experienced reader to scan. My experience is that placing too much emphasis on syntax and layout enables a sloppiness in "thinking about things". C is not the language for one who overly-relies on the compiler's syntax checking and the code's layout to "get it right". The most invisible bugs are those that hide in the prettiest code.
Careful review of the posted code shows that, in spite of good intentions, "Allman style" has not been applied in all cases where it should have been. Each individual is fully entitled to have their own opinion! I find this "spacious" style distracting, contributing to code bloat (ie. extended vertical space is detrimental to 'readability'.)
And, those C89 comments trigger both nostaligia and gratitude that //
can be used instead (with modern compilers).
Eliminating "Allman style" braces shrinks the code by almost 100 lines. These functions are made from lots of lines of code. Having fewer lines of significant code visible and needing to scroll is disadvantageous.
1,000m
With greater density, it becomes apparent: The original author strove for light-weight and fast. "Hashing" allowed that code to use a 4Kb table. This code uses a 64Kb table. An unfortunate consequence is that repeatedly populating and searching the OP's table requires 16x the number of full table traversals (on many, many iterations.) Removing that hashing functionality is, imo, a step backwards.
10m
Something is worrisome...
Removing the many print debug statements in both source files,
it becomes apparent that the OP has only superficially revised the copyrighted source code of the linked article.
Some loops have been bent (still functioning), but most of the variable names and the logic are exactly the same!!
The OP linked to the online version of an article that was published in the "The C Users Journal" in Feb. 1994. I have a PDF scanned copy of the article (that starts on page 23 of that issue.) That hard-copy version was timely! I used "my" version of Gage's original code in a project for a client in 1995-6... thus my interest in this.
The OP writes:
To start the project, I used his provided C code as a reference, but decided early on to rewrite without looking at it. So there will be similarities to his original code in certain design decisions.
// Phil Gage's code (1994):
buffer[size++] = c;
/* Use rightcode to flag data chars found */
if (!rightcode[c]) {
rightcode[c] = 1;
used++;
}
vs.
// OP's code (2024):
/* increase used count if new, mark in pair table as used */
if (right[c] == 0)
{
right[c] = 1;
++used;
}
buffer[size++] = c; /* push c to buffer */
It is a decidedly bad practice to overstate one's achievements!
Life (and the workplace) will smack you down to (or even below!) the deserved level when found-out.
NEVER make unjustified claims that overstate your work.
Mr. Gage wrote, 30 years ago:
For simplicity, error handling is minimal.
To write good code that deals with potentially mangled source data or users misusing your program, plugging the gaps by robustly handling errors is far more important than simply "re-arranging the furniture".
A "gift"
... voluminous amount of debug info is generated which can't be turned off.
Instead of re-publishing, here, a working function that allows code to tersely use varying degrees of verbosity for its many "print debugging" statements,
here is a link to another program for the OP to study.
Each debugging/tracing printf()
(or fprintf( stderr, ...)
) can have its own "reporting threshold" level.
Suggested exercises
Now that "byte pair encoding" has been thoroughly examined, fulfill the claim "without looking at it", by starting from scratch and writing your own source code that achieves Byte Pair Encoding.
Research the theory of "Huffman" encoding and set-about writing your own source code implementation. The theory and scheme are well documented in many places on the web. Implement both sides of a working Huffman program from scratch.
-
\$\begingroup\$ Thank you for your review, as you are clearly familiar with this topic. I agree the file handling can be replaced by just using stdin and stdout. I have been messing around with more shell tools recently, including jq which really does this filter concept well, and the shell pipe can handle some I/O. I can then put logging in stderr (sometimes I wish there was stdlog for this purpose). The single source file is interesting but I've tried to avoid using preprocessor tricks. In a revised version I'm working on, I put if (DEBUG) blocks wherever there's debug printing, which is okayish. \$\endgroup\$qwr– qwr2024年08月17日 04:30:10 +00:00Commented Aug 17, 2024 at 4:30
-
1\$\begingroup\$ @qwr Yes, and I am not suggesting any plagiarism. And, that there are only so many ways to skin a cat... Thirty years ago, I adapted Gage's code for my purposes and was (silently) grateful for all of it. In that adaptation I renamed the two primary functions
encode()
/decode()
in keeping with Byte Pair ENCODING... (Pretty sure I also combinedleftcode[]
andrightcode[]
intocode[256][2]
...) However, because my version was shipped to a client, paramount was adding error detection/recovery for corrupt data (that did not make the edit for the magazine article.) Coding is details. \$\endgroup\$user272752– user2727522024年08月17日 07:17:41 +00:00Commented Aug 17, 2024 at 7:17 -
1\$\begingroup\$ @qwr "preprocessor tricks" One man's trick is another's technique... If one did "factor out"
main()
to its own source file (~60 LoC ==> ~30 LoC), one would need, too, one (or two) tiny app header files... OTOH, "pooling" into a single file (allowingstatic
qualifiers, too) could mean that >>everything<< is in one package... I like to use tricks that consolidate and reduce clutter. (But, maybe that's just me...) Cheers!:-)
\$\endgroup\$user272752– user2727522024年08月17日 07:34:02 +00:00Commented Aug 17, 2024 at 7:34 -
1\$\begingroup\$ Again, Welcome to C. You can decide. With thanks to Gage for publishing his code 30 years ago, I suggest you consider putting this project on the shelf (for a while), and go on to coding a "binary search tree" or whatever to expand your familiarity with the language and its libraries. With its admitted flaws, there's this challenge (for instance). Careful about thinking that one project gives a broad overview of a language. Both frustration and delight await when you tackle many, diverse challenges. Cheers! \$\endgroup\$user272752– user2727522024年08月30日 22:04:36 +00:00Commented Aug 30, 2024 at 22:04
-
1\$\begingroup\$ Again, I acknowledge your liberty to use this (1994) public domain code in any way that interests you (in 2024). I again suggest you press pause on this project and broaden the scope of applications you address in order to learn how to use C based 'approaches' and idioms when solving coding problems. (Back in 95-96 my task involved exchanging data (9600 baud) with a handheld scanner with VERY limited RAM. It is a mistake to assess things from the past based on modern norms.) Again, best wishes on your journey... Maybe there's another "Quake Inverse Square" algorithm waiting to be uncovered... \$\endgroup\$user272752– user2727522024年09月01日 01:27:12 +00:00Commented Sep 1, 2024 at 1:27
Move away from globals
A key to efficient code writing is code re-use. Plan for the future.
Rather than global, pass in a pointer to this data. This allows easier integration of these routines into future code.
typdef struct {
uchar buffer[BLOCKSIZE];
uchar left[256], right[256]; /* pair table */
uchar count[256][256]; /* pair counts */
int size;
} pair_data;
// int expand(FILE* infile, FILE* outfile)
int expand(pair_data *pd, FILE* infile, FILE* outfile)
Write errors to stderr
// printf("bad outfile\n");
fprintf(stderr, "bad outfile\n");
Use sentences and sentence case
fprintf(stderr, "Bad outfile.\n");
IMO, show the bad filename and with sentinels.
fprintf(stderr, "Bad outfile \"%s\".\n", argv[2]);
Watch for errors
usize = getc(infile); lsize = getc(infile);
and the return value of many other getc()
calls do not check if EOF
was returned. Error resilient code would do so.
Code assumes int
is wider than 16-bit
Portable C89 code would handle 16-bit int/unsigned
. This impacts code like size = (usize << 8) + lsize;
, assert(size <= 0xFFFF);
,
putc(size >> 8, outfile);
, ...
Small Stuff
Avoid printing signed types with unsigned specifiers
Change format, change type or use casts.
// int b = 0, usize, lsize, size;
unsigned b = 0, usize, lsize, size;
Put your toys away
The following code does not fclose(outfile)
if infile
is NULL
. That makes little difference in main()
here, yet it is good practice to fclose()
whatever is successfully fopen()
.
infile = fopen(argv[1], "r");
if (infile == NULL) {
printf("bad infile\n");
return -1;
}
outfile = fopen(argv[2], "w");
if (outfile == NULL) {
fclose(infile);
printf("bad outfile\n");
return -1;
}
Avoid naked magic numbers
Rather than 256, use UCHAR_MAX + 1
or the like.
#define UCHAR_N (UCHAR_MAX + 1)
Avoid bracketless bodies
This is a style issue, but, IMO, a preferred one.
// while (notdone)
// notdone = expand(infile, outfile);
while (notdone) {
notdone = expand(infile, outfile);
}
Questionable type choice
Code uses signed char
instead of int
for some reason. int
is certainly more performant.
// signed char count = 0;
int count = 0;
Pedantic: non-2's complement
Following does not well tear apart size
.
// Fails with non-2's complement
putc(size >> 8, outfile);
putc(size & 0xFF, outfile);
Instead use an unsigned type for the size:
putc(usize >> 8, outfile);
putc(usize & 0xFF, outfile);
-
\$\begingroup\$ Thank you for the answer. I'm going through it. I don't see the difference with the pointer vs globals for this small program. I can see it mattering for bigger programs. \$\endgroup\$qwr– qwr2024年08月13日 22:34:21 +00:00Commented Aug 13, 2024 at 22:34
-
\$\begingroup\$ I used signed char for count because I want to interpret the value I got from getc as a signed byte. To my knowledge, if getc doesn't fail then it will be an unsigned value that fits in an unsigned char. \$\endgroup\$qwr– qwr2024年08月14日 04:42:43 +00:00Commented Aug 14, 2024 at 4:42
-
\$\begingroup\$ What is the issue with
size = (usize << 8) + lsize;
? is it the need to have unsigned int if it's 16 bit? \$\endgroup\$qwr– qwr2024年08月14日 04:44:14 +00:00Commented Aug 14, 2024 at 4:44 -
\$\begingroup\$ @qwr Yes. Shifting a 1 into the sign bit is UB. Best to use various unsigned types for sizing anyways. \$\endgroup\$chux– chux2024年08月14日 05:08:03 +00:00Commented Aug 14, 2024 at 5:08
-
1\$\begingroup\$ @qwr: pointers vs global matters on small disposable programs the most, because those are 100% the ones that end up a mission-critical application deployed to production. \$\endgroup\$Bryan B– Bryan B2024年08月15日 22:51:26 +00:00Commented Aug 15, 2024 at 22:51
Explore related questions
See similar questions with these tags.
256
instead of through a macro? \$\endgroup\$UCHAR_MAX +1
instead of 256. \$\endgroup\$int/unsigned
was 16 bit, do you want to handle that potentiality? \$\endgroup\$