Byte Pair Encoding Compression in C89

Question 1

This is a small project to implement Philip Gage's Byte Pair Encoding compression for the purpose of learning C. In particular, it's written in C89 for fun as that's what would've been contemporary (though I really miss some nice C99 features). Disclaimer: To start the project, I used his provided C code as a reference, but decided early on to rewrite without looking at it. So there will be similarities to his original code in certain design decisions.

Here are details I'd maybe improve:

I tested the programs on some small text and binary files (1 MB), but some bugs still might remain. Correctness is the most important here.
A voluminous amount of debug info is generated which can't be turned off. It was helpful for debugging by writing it to a log file. My ideas were a macro to disable printf, or #ifdef sections to only print with a DEBUG flag defined.
Compress should work with any input file, but expand doesn't do error checking on a compressed file. I'm not concerned with extreme cases like printf failing or running out of disk space. I didn't dynamically allocate anything so those are not an issue.
I don't find my code easy to read, but I also don't read C much at all. Some functions and blocks may need to be re-organized. (I use Allman indentation style even though almost all C code online is in K&R style. In fact I couldn't find any large C codebases that used Allman style, but I assume it is reasonable.)

Implementation details: I picked BPE because I thought it was a simple compression algorithm. It is in theory, but fundamentally limited by the byte as a unit and number of unused byte values available. If a block happens to start with many different byte values, then the block will be short with little compression. The pair table RLE is fiddly (I didn't implement it exactly) and limited in size, while a Huffman code can be as large as needed. Encoding can't be fast either, because my program took dozens of passes over each block, while I believe Huffman coding can be done in one or two passes over a larger block.

compress.c

/* Byte Pair Encoding compression 
 Based on idea and code from Philip Gage 
 Pseudocode: 
 While not end of file
 Read next block of data into buffer and
 enter all pairs in hash table with counts of their occurrence
 While compression possible
 Find most frequent byte pair
 Replace pair with an unused byte
 If substitution deletes a pair from buffer,
 decrease its count in the hash table
 If substitution adds a new pair to the buffer,
 increase its count in the hash table
 Add pair to pair table
 End while
 Write pair table and packed data
 End while
*/
#include <stdio.h>
#include <assert.h>
#define BLOCKSIZE 5000 /* Maximum block size */
#define MAXCHARS 200 /* Charset per block (leave some unused) */
#define MINPAIRS 3 /* Min pairs needed for compress */
typedef unsigned char uchar;
/* file-wide globals per-block */
uchar buffer[BLOCKSIZE];
uchar left[256], right[256]; /* pair table */
uchar count[256][256]; /* pair counts */ 
int size; 
/* read block from file into pair count */
/* return true if not done reading file */
int readblock(FILE* infile)
{
 int i, j, c;
 int used = 0;
 printf("*** READ BLOCK ***\n");
 /* reset counts and pair table */
 for (i = 0; i < 256; ++i) 
 {
 for (j = 0; j < 256; ++j) 
 count[i][j] = 0;
 
 left[i] = i;
 right[i] = 0;
 }
 size = 0; /* block size */
 /* C I/O, get one char at a time */
 /* stopping at EOF, BLOCKSIZE limit, or MAXCHARS limit */
 while (size < BLOCKSIZE && used < MAXCHARS) 
 { 
 /* only read now instead of in while condition */
 c = getc(infile);
 if (c == EOF) break;
 if (size > 0) 
 {
 uchar lastc = buffer[size-1];
 /* count pairs without overflow */
 if (count[lastc][c] < 255) 
 ++count[lastc][c];
 }
 /* increase used count if new, mark in pair table as used */
 if (right[c] == 0) 
 {
 right[c] = 1;
 ++used;
 }
 buffer[size++] = c; /* push c to buffer */
 }
 printf("size: %d used: %d\n", size, used);
 printf("buffer:\n");
 for (i=0; i<size; ++i)
 printf("%02x ", buffer[i]);
 printf("\n(non-zero) count table:\n");
 for (i=0; i<256; ++i) 
 for (j=0; j<256; ++j) 
 if (count[i][j]) 
 printf("%02x%02x:%02x\t", i, j, count[i][j]);
 
 printf("\n");
 return (c != EOF);
}
/* for block, write pair and packed data */
void compress() 
{
 int pass, i, j, y;
 printf("*** COMPRESS BLOCK ***\n");
 /* while compression possible:
 pick pairs until no unused bytes or no good pairs */
 for (pass = 1; ; ++pass) 
 {
 int r = 0, w = 0; /* read and write index */
 uchar bestcount = 0;
 uchar bestleft = 0, bestright = 0;
 printf("COMPRESSION PASS %d\n", pass);
 
 for (i=0; i<256; ++i) 
 {
 for (j=0; j<256; ++j) 
 {
 /* record best pair and count */
 if (count[i][j] > bestcount) 
 {
 bestcount = count[i][j];
 bestleft = i; bestright = j; 
 }
 }
 }
 printf("best pair %02x%02x:%d\n", bestleft, bestright, bestcount);
 if (bestcount < MINPAIRS) 
 break;
 /* find unused byte to use */
 for (y=255; y>=0; --y)
 if (left[y] == y && right[y] == 0) 
 break;
 
 if (y < 0) break; /* no more unused */
 printf("unused byte: %02x\n", y);
 
 
 /* replace pairs with unused byte in-place in buffer */
 while (r < size) 
 {
 /* match best pair */
 if (r+1 < size && 
 buffer[r] == bestleft && buffer[r+1] == bestright) 
 {
 buffer[w++] = y; /* write new byte */
 r += 2; /* move read index past pair */
 } 
 else 
 {
 /* copy buffer[r] to buffer[w], increment indexes */
 buffer[w++] = buffer[r++]; 
 }
 }
 size = w; /* adjust written buffer size */
 
 /* TODO: update counts during writing instead */
 /* recreate count table */
 for (i = 0; i < 256; ++i)
 for (j = 0; j < 256; ++j)
 count[i][j] = 0;
 
 for (i=0; i<size; ++i) 
 {
 if (i+1 < size) 
 {
 uchar c = buffer[i];
 uchar d = buffer[i+1];
 if (count[c][d] < 255) 
 ++count[c][d];
 }
 }
 printf("new buffer(%d): ", size);
 for (i=0; i<size; ++i) 
 printf("%02x ", buffer[i] );
 
 printf("\n");
 /* add pair in pair table */
 left[y] = bestleft;
 right[y] = bestright;
 
 printf("used pair table:\n");
 for (i=0; i<256; ++i) 
 {
 if (i != left[i]) 
 printf("%02x:%02x%02x\n", i, left[i], right[i]);
 }
 printf("\n");
 }
 printf("\n");
}
/* write pair table and compressed data */
void writeblock(FILE* outfile) 
{
 int c = 0;
 signed char count = 0;
 printf("*** WRITE BLOCK ***\n");
 while (c < 256) 
 {
 printf("c: %02x\n",c);
 
 count = 0;
 /* run of non-pairs */
 if (c == left[c]) 
 {
 while (c == left[c] && c < 256 && count > -128) 
 {
 ++c;
 --count;
 }
 /* output count as negative byte */
 assert(count < 0);
 putc(count, outfile);
 printf("count:%d\n", count);
 /* output single pair if not end of table */
 if (c < 256) 
 {
 putc(left[c], outfile);
 putc(right[c], outfile); 
 printf("single pair %02x%02x\n", left[c], right[c]);
 ++c;
 }
 
 } 
 else 
 {
 /* run of pairs */
 int b = c; /* index of start of run */
 while (c != left[c] && c < 256 && count < 127) 
 {
 ++c;
 ++count;
 }
 /* output count */
 assert(count > 0);
 putc(count, outfile);
 printf("count:%d\n", count);
 for (; b < c; ++b) 
 {
 putc(left[b], outfile);
 putc(right[b], outfile);
 printf("%02x%02x\n", left[b], right[b]);
 }
 
 }
 }
 /* write compressed buffer size */
 assert(size <= 0xFFFF);
 putc(size >> 8, outfile);
 putc(size & 0xFF, outfile);
 printf("compressed size: %d (%04x)\n", size, size);
 
 /* write compressed buffer */
 fwrite(buffer, 1, size, outfile);
 printf("write buffer\n");
 
}
int main(int argc, char* argv[]) 
{
 int notdone;
 FILE* infile, * outfile;
 if (argc != 3) 
 {
 printf("Usage: compress infile outfile\n");
 return -1;
 }
 infile = fopen(argv[1], "r");
 outfile = fopen(argv[2], "w"); 
 if (infile == NULL) 
 {
 printf("bad infile\n");
 return -1;
 }
 if (outfile == NULL) 
 {
 printf("bad outfile\n");
 return -1;
 }
 notdone = 1;
 while (notdone)
 {
 notdone = readblock(infile);
 compress();
 writeblock(outfile);
 }
 
 fclose(infile);
 fclose(outfile);
 return 0;
}

expand.c

/* BPE expand routine
 pseudocode:
 While not end of file
 Read pair table from input
 While more data in block
 If stack empty, read byte from input
 Else pop byte from stack
 If byte in table, push pair on stack
 Else write byte to output
 End while
 End while
*/
#include <stdio.h>
#include <assert.h>
unsigned char left[256], right[256];
unsigned char stack[256]; /* overflow? */
/* expand block */
/* return true if there are more blocks (doesn't end in EOF) */
int expand(FILE* infile, FILE* outfile) 
{ 
 int c, i, sp = 0;
 signed char count;
 int b = 0, usize, lsize, size;
 /* reset pair table to defaults */
 for (i = 0; i < 256; ++i)
 {
 left[i] = i;
 right[i] = 0;
 }
 /* read compressed pair table */
 while(b < 256) /* b = current table index */
 {
 c = getc(infile);
 if (c == EOF) return 0; /* last block */
 count = (signed char)c; 
 printf("b: %d Count: %d\n", b, count);
 assert(count != 0);
 /* negative count: skip forward by |count| then read a pair */
 if (count < 0)
 {
 b += -count;
 /* if not end table, read single pair */
 if (b < 256)
 {
 /* doesn't handle if file unexpectedly ends */
 left[b] = getc(infile);
 right[b] = getc(infile);
 printf("Read single pair %02x%02x\n", left[b], right[b]);
 ++b;
 }
 }
 else /* positive count: read |count| pairs */
 {
 int b_end = b + count;
 for (; b < b_end; ++b)
 {
 left[b] = getc(infile);
 right[b] = getc(infile);
 printf("Read pair %02x%02x\n", left[b], right[b]);
 }
 }
 }
 
 assert(b == 256); /* counts valid */
 printf("Pair table:\n");
 for (b = 0; b < 256; ++b) 
 {
 printf("%02x:%02x%02x\t", b, left[b], right[b]);
 }
 printf("\n");
 /* read compressed buffer size */
 usize = getc(infile);
 lsize = getc(infile);
 size = (usize << 8) + lsize;
 printf("size: %d(%02x%02x)\n", size, usize, lsize);
 /* write output, pushing pairs to stack */
 i = 0; 
 while (i < size || sp) /* more to read or stack non-empty */
 {
 if (sp == 0) /* stack empty */
 {
 c = getc(infile); /* read byte */
 printf("read byte: %02x\n", c);
 ++i; 
 }
 else
 {
 c = stack[--sp]; /* pop byte */
 printf("pop byte: %02x\n", c);
 }
 if (c != left[c]) /* pair in table */ 
 {
 /* push pair */
 stack[sp++] = right[c];
 stack[sp++] = left[c];
 printf("push pair %02x%02x\n", left[c], right[c]);
 }
 else /* pair not in table */
 {
 putc(c, outfile); /* write literal byte */
 printf("write byte %02x\n", c);
 }
 }
 
 return 1; /* try another block */
}
int main(int argc, char* argv[]) 
{
 FILE* infile, * outfile;
 int notdone;
 if (argc != 3) 
 {
 printf("Usage: expand infile outfile\n");
 return -1;
 }
 infile = fopen(argv[1], "r");
 outfile = fopen(argv[2], "w"); 
 if (infile == NULL) 
 {
 printf("bad infile\n");
 return -1;
 }
 if (outfile == NULL) 
 {
 printf("bad outfile\n");
 return -1;
 }
 /* process blocks */
 notdone = 1;
 while (notdone)
 notdone = expand(infile, outfile);
 fclose(infile);
 fclose(outfile); 
 return 0;
}

For fun, I made a simple Makefile to show how to compile.

all: compress expand
compress: compress.c
 gcc -O2 -Wall -Wextra -std=c89 -pedantic -o compress compress.c 
expand: expand.c
 gcc -O2 -Wall -Wextra -std=c89 -pedantic -o expand expand.c

Question 2

Why does code use naked magic number like 256 instead of through a macro?

Question 3

@chux-ReinstateMonica I didn't know what to name it. I would've said like UCHAR_MAX but that is 255.

Question 4

Could have use UCHAR_MAX +1 instead of 256.

Question 5

If int/unsigned was 16 bit, do you want to handle that potentiality?

Question 6

I think all values fit into 16-bit ints based on BLOCKSIZE, but I haven't tested.

Question 7

Recommend putting filename inside the source code file(s).
The link to the online doco should appear in the source code, too.

10,000m

Code looks clean at first glance.
Whitespace and structure look promising.
And, there is pseudocode in the source code, too!

Sticking to C89 standard is alright if that's something you want to try. When she was 3, my daughter would always hop with her right leg when crossing our driveway. Not everything has to make sense.

5,000m

Thirty-five lines of very similar code for main() in both files.
There are many ways to write this functionality.

Perhaps a single main() (~35 LoC) should be in its own source file that is compiled & linked with each of the two "operation" blocks of code each left in their source files.
Perhaps all functionality should be moved into one file so there is a single source file for entire project.
Perhaps there's no need for a makefile. Perhaps the compile command line(s) could appear in the source code, too. Copy/paste/execute is easy-as-pie...

Using the preprocessor and a token-or-two, the single source file could still be used to generate two different executables, each suitable for its purpose.

While looking at main(), recognise that the functionality could be achieved (perhaps better!) by designing the project to act as a 'filter':

cat file0.txt | compress | expand > file1.txt // perform both operations
// or...
expand < file2.bpe | grep "some text" // perform a "lookup" of something

This presumes that the "print debugging" statements, currently going to stdout, have been sanitised/neutralised. (Note, too, the use of ".bpe" as the file extension. Contents of a compressed file are mostly a meaningless collection of bytes until they are "expanded".)

There's a possible net reduction of 60+ lines of code. Unnecessary code (and its comments) provides more breeding grounds for unnecessary bugs.

2,000m

It's the coder's choice, therefore not wrong, but that "Allman style" makes it difficult for an experienced reader to scan. My experience is that placing too much emphasis on syntax and layout enables a sloppiness in "thinking about things". C is not the language for one who overly-relies on the compiler's syntax checking and the code's layout to "get it right". The most invisible bugs are those that hide in the prettiest code.

Careful review of the posted code shows that, in spite of good intentions, "Allman style" has not been applied in all cases where it should have been. Each individual is fully entitled to have their own opinion! I find this "spacious" style distracting, contributing to code bloat (ie. extended vertical space is detrimental to 'readability'.)

And, those C89 comments trigger both nostaligia and gratitude that // can be used instead (with modern compilers).

Eliminating "Allman style" braces shrinks the code by almost 100 lines. These functions are made from lots of lines of code. Having fewer lines of significant code visible and needing to scroll is disadvantageous.

1,000m

With greater density, it becomes apparent: The original author strove for light-weight and fast. "Hashing" allowed that code to use a 4Kb table. This code uses a 64Kb table. An unfortunate consequence is that repeatedly populating and searching the OP's table requires 16x the number of full table traversals (on many, many iterations.) Removing that hashing functionality is, imo, a step backwards.

10m

Something is worrisome...
Removing the many print debug statements in both source files, it becomes apparent that the OP has only superficially revised the copyrighted source code of the linked article. Some loops have been bent (still functioning), but most of the variable names and the logic are exactly the same!!

The OP linked to the online version of an article that was published in the "The C Users Journal" in Feb. 1994. I have a PDF scanned copy of the article (that starts on page 23 of that issue.) That hard-copy version was timely! I used "my" version of Gage's original code in a project for a client in 1995-6... thus my interest in this.

The OP writes:

To start the project, I used his provided C code as a reference, but decided early on to rewrite without looking at it. So there will be similarities to his original code in certain design decisions.

// Phil Gage's code (1994):
 buffer[size++] = c;
 /* Use rightcode to flag data chars found */
 if (!rightcode[c]) {
 rightcode[c] = 1;
 used++;
 }

vs.

// OP's code (2024):
 /* increase used count if new, mark in pair table as used */
 if (right[c] == 0) 
 {
 right[c] = 1;
 ++used;
 }
 buffer[size++] = c; /* push c to buffer */

It is a decidedly bad practice to overstate one's achievements!
Life (and the workplace) will smack you down to (or even below!) the deserved level when found-out.
NEVER make unjustified claims that overstate your work.

Mr. Gage wrote, 30 years ago:

For simplicity, error handling is minimal.

To write good code that deals with potentially mangled source data or users misusing your program, plugging the gaps by robustly handling errors is far more important than simply "re-arranging the furniture".

A "gift"

... voluminous amount of debug info is generated which can't be turned off.

Instead of re-publishing, here, a working function that allows code to tersely use varying degrees of verbosity for its many "print debugging" statements, here is a link to another program for the OP to study. Each debugging/tracing printf() (or fprintf( stderr, ...)) can have its own "reporting threshold" level.

Suggested exercises

Now that "byte pair encoding" has been thoroughly examined, fulfill the claim "without looking at it", by starting from scratch and writing your own source code that achieves Byte Pair Encoding.
Research the theory of "Huffman" encoding and set-about writing your own source code implementation. The theory and scheme are well documented in many places on the web. Implement both sides of a working Huffman program from scratch.

Question 8

Thank you for your review, as you are clearly familiar with this topic. I agree the file handling can be replaced by just using stdin and stdout. I have been messing around with more shell tools recently, including jq which really does this filter concept well, and the shell pipe can handle some I/O. I can then put logging in stderr (sometimes I wish there was stdlog for this purpose). The single source file is interesting but I've tried to avoid using preprocessor tricks. In a revised version I'm working on, I put if (DEBUG) blocks wherever there's debug printing, which is okayish.

Question 9

@qwr Yes, and I am not suggesting any plagiarism. And, that there are only so many ways to skin a cat... Thirty years ago, I adapted Gage's code for my purposes and was (silently) grateful for all of it. In that adaptation I renamed the two primary functions encode()/decode() in keeping with Byte Pair ENCODING... (Pretty sure I also combined leftcode[] and rightcode[] into code[256][2]...) However, because my version was shipped to a client, paramount was adding error detection/recovery for corrupt data (that did not make the edit for the magazine article.) Coding is details.

Question 10

@qwr "preprocessor tricks" One man's trick is another's technique... If one did "factor out" main() to its own source file (~60 LoC ==> ~30 LoC), one would need, too, one (or two) tiny app header files... OTOH, "pooling" into a single file (allowing static qualifiers, too) could mean that >>everything<< is in one package... I like to use tricks that consolidate and reduce clutter. (But, maybe that's just me...) Cheers! :-)

Question 11

Again, Welcome to C. You can decide. With thanks to Gage for publishing his code 30 years ago, I suggest you consider putting this project on the shelf (for a while), and go on to coding a "binary search tree" or whatever to expand your familiarity with the language and its libraries. With its admitted flaws, there's this challenge (for instance). Careful about thinking that one project gives a broad overview of a language. Both frustration and delight await when you tackle many, diverse challenges. Cheers!

Question 12

Again, I acknowledge your liberty to use this (1994) public domain code in any way that interests you (in 2024). I again suggest you press pause on this project and broaden the scope of applications you address in order to learn how to use C based 'approaches' and idioms when solving coding problems. (Back in 95-96 my task involved exchanging data (9600 baud) with a handheld scanner with VERY limited RAM. It is a mistake to assess things from the past based on modern norms.) Again, best wishes on your journey... Maybe there's another "Quake Inverse Square" algorithm waiting to be uncovered...

Question 13

Move away from globals

A key to efficient code writing is code re-use. Plan for the future.

Rather than global, pass in a pointer to this data. This allows easier integration of these routines into future code.

typdef struct {
 uchar buffer[BLOCKSIZE];
 uchar left[256], right[256]; /* pair table */
 uchar count[256][256]; /* pair counts */ 
 int size; 
} pair_data;
// int expand(FILE* infile, FILE* outfile) 
int expand(pair_data *pd, FILE* infile, FILE* outfile)

Write errors to stderr

// printf("bad outfile\n");
fprintf(stderr, "bad outfile\n");

Use sentences and sentence case

fprintf(stderr, "Bad outfile.\n");

IMO, show the bad filename and with sentinels.

fprintf(stderr, "Bad outfile \"%s\".\n", argv[2]);

Watch for errors

usize = getc(infile); lsize = getc(infile); and the return value of many other getc() calls do not check if EOF was returned. Error resilient code would do so.

Code assumes int is wider than 16-bit

Portable C89 code would handle 16-bit int/unsigned. This impacts code like size = (usize << 8) + lsize;, assert(size <= 0xFFFF);, putc(size >> 8, outfile);, ...

Small Stuff

Avoid printing signed types with unsigned specifiers

Change format, change type or use casts.

// int b = 0, usize, lsize, size;
unsigned b = 0, usize, lsize, size;

Put your toys away

The following code does not fclose(outfile) if infile is NULL. That makes little difference in main() here, yet it is good practice to fclose() whatever is successfully fopen().

 infile = fopen(argv[1], "r");
 if (infile == NULL) {
 printf("bad infile\n");
 return -1;
 }
 outfile = fopen(argv[2], "w"); 
 if (outfile == NULL) {
 fclose(infile);
 printf("bad outfile\n");
 return -1;
 }

Avoid naked magic numbers

Rather than 256, use UCHAR_MAX + 1 or the like.

#define UCHAR_N (UCHAR_MAX + 1)

Avoid bracketless bodies

This is a style issue, but, IMO, a preferred one.

// while (notdone)
// notdone = expand(infile, outfile);
 while (notdone) {
 notdone = expand(infile, outfile);
 }

Questionable type choice

Code uses signed char instead of int for some reason. int is certainly more performant.

// signed char count = 0;
int count = 0;

Pedantic: non-2's complement

Following does not well tear apart size.

// Fails with non-2's complement
putc(size >> 8, outfile);
putc(size & 0xFF, outfile);

Instead use an unsigned type for the size:

putc(usize >> 8, outfile);
putc(usize & 0xFF, outfile);

Question 14

Thank you for the answer. I'm going through it. I don't see the difference with the pointer vs globals for this small program. I can see it mattering for bigger programs.

Question 15

I used signed char for count because I want to interpret the value I got from getc as a signed byte. To my knowledge, if getc doesn't fail then it will be an unsigned value that fits in an unsigned char.

Question 16

What is the issue with size = (usize << 8) + lsize;? is it the need to have unsigned int if it's 16 bit?

Question 17

@qwr Yes. Shifting a 1 into the sign bit is UB. Best to use various unsigned types for sizing anyways.

Question 18

@qwr: pointers vs global matters on small disposable programs the most, because those are 100% the ones that end up a mission-critical application deployed to production.

user272752user272752 · Accepted Answer · 2024-08-17 02:13:35Z

Recommend putting filename inside the source code file(s).
The link to the online doco should appear in the source code, too.

10,000m

Code looks clean at first glance.
Whitespace and structure look promising.
And, there is pseudocode in the source code, too!

Sticking to C89 standard is alright if that's something you want to try. When she was 3, my daughter would always hop with her right leg when crossing our driveway. Not everything has to make sense.

5,000m

Thirty-five lines of very similar code for main() in both files.
There are many ways to write this functionality.

Perhaps a single main() (~35 LoC) should be in its own source file that is compiled & linked with each of the two "operation" blocks of code each left in their source files.
Perhaps all functionality should be moved into one file so there is a single source file for entire project.
Perhaps there's no need for a makefile. Perhaps the compile command line(s) could appear in the source code, too. Copy/paste/execute is easy-as-pie...

Using the preprocessor and a token-or-two, the single source file could still be used to generate two different executables, each suitable for its purpose.

While looking at main(), recognise that the functionality could be achieved (perhaps better!) by designing the project to act as a 'filter':

cat file0.txt | compress | expand > file1.txt // perform both operations
// or...
expand < file2.bpe | grep "some text" // perform a "lookup" of something

This presumes that the "print debugging" statements, currently going to stdout, have been sanitised/neutralised. (Note, too, the use of ".bpe" as the file extension. Contents of a compressed file are mostly a meaningless collection of bytes until they are "expanded".)

There's a possible net reduction of 60+ lines of code. Unnecessary code (and its comments) provides more breeding grounds for unnecessary bugs.

2,000m

It's the coder's choice, therefore not wrong, but that "Allman style" makes it difficult for an experienced reader to scan. My experience is that placing too much emphasis on syntax and layout enables a sloppiness in "thinking about things". C is not the language for one who overly-relies on the compiler's syntax checking and the code's layout to "get it right". The most invisible bugs are those that hide in the prettiest code.

Careful review of the posted code shows that, in spite of good intentions, "Allman style" has not been applied in all cases where it should have been. Each individual is fully entitled to have their own opinion! I find this "spacious" style distracting, contributing to code bloat (ie. extended vertical space is detrimental to 'readability'.)

And, those C89 comments trigger both nostaligia and gratitude that // can be used instead (with modern compilers).

Eliminating "Allman style" braces shrinks the code by almost 100 lines. These functions are made from lots of lines of code. Having fewer lines of significant code visible and needing to scroll is disadvantageous.

1,000m

With greater density, it becomes apparent: The original author strove for light-weight and fast. "Hashing" allowed that code to use a 4Kb table. This code uses a 64Kb table. An unfortunate consequence is that repeatedly populating and searching the OP's table requires 16x the number of full table traversals (on many, many iterations.) Removing that hashing functionality is, imo, a step backwards.

10m

Something is worrisome...
Removing the many print debug statements in both source files, it becomes apparent that the OP has only superficially revised the copyrighted source code of the linked article. Some loops have been bent (still functioning), but most of the variable names and the logic are exactly the same!!

The OP linked to the online version of an article that was published in the "The C Users Journal" in Feb. 1994. I have a PDF scanned copy of the article (that starts on page 23 of that issue.) That hard-copy version was timely! I used "my" version of Gage's original code in a project for a client in 1995-6... thus my interest in this.

The OP writes:

To start the project, I used his provided C code as a reference, but decided early on to rewrite without looking at it. So there will be similarities to his original code in certain design decisions.

// Phil Gage's code (1994):
 buffer[size++] = c;
 /* Use rightcode to flag data chars found */
 if (!rightcode[c]) {
 rightcode[c] = 1;
 used++;
 }

vs.

// OP's code (2024):
 /* increase used count if new, mark in pair table as used */
 if (right[c] == 0) 
 {
 right[c] = 1;
 ++used;
 }
 buffer[size++] = c; /* push c to buffer */

It is a decidedly bad practice to overstate one's achievements!
Life (and the workplace) will smack you down to (or even below!) the deserved level when found-out.
NEVER make unjustified claims that overstate your work.

Mr. Gage wrote, 30 years ago:

For simplicity, error handling is minimal.

To write good code that deals with potentially mangled source data or users misusing your program, plugging the gaps by robustly handling errors is far more important than simply "re-arranging the furniture".

A "gift"

... voluminous amount of debug info is generated which can't be turned off.

Instead of re-publishing, here, a working function that allows code to tersely use varying degrees of verbosity for its many "print debugging" statements, here is a link to another program for the OP to study. Each debugging/tracing printf() (or fprintf( stderr, ...)) can have its own "reporting threshold" level.

Suggested exercises

Now that "byte pair encoding" has been thoroughly examined, fulfill the claim "without looking at it", by starting from scratch and writing your own source code that achieves Byte Pair Encoding.
Research the theory of "Huffman" encoding and set-about writing your own source code implementation. The theory and scheme are well documented in many places on the web. Implement both sides of a working Huffman program from scratch.

Thank you for your review, as you are clearly familiar with this topic. I agree the file handling can be replaced by just using stdin and stdout. I have been messing around with more shell tools recently, including jq which really does this filter concept well, and the shell pipe can handle some I/O. I can then put logging in stderr (sometimes I wish there was stdlog for this purpose). The single source file is interesting but I've tried to avoid using preprocessor tricks. In a revised version I'm working on, I put if (DEBUG) blocks wherever there's debug printing, which is okayish.
@qwr Yes, and I am not suggesting any plagiarism. And, that there are only so many ways to skin a cat... Thirty years ago, I adapted Gage's code for my purposes and was (silently) grateful for all of it. In that adaptation I renamed the two primary functions encode()/decode() in keeping with Byte Pair ENCODING... (Pretty sure I also combined leftcode[] and rightcode[] into code[256][2]...) However, because my version was shipped to a client, paramount was adding error detection/recovery for corrupt data (that did not make the edit for the magazine article.) Coding is details.
@qwr "preprocessor tricks" One man's trick is another's technique... If one did "factor out" main() to its own source file (~60 LoC ==> ~30 LoC), one would need, too, one (or two) tiny app header files... OTOH, "pooling" into a single file (allowing static qualifiers, too) could mean that >>everything<< is in one package... I like to use tricks that consolidate and reduce clutter. (But, maybe that's just me...) Cheers! :-)
Again, Welcome to C. You can decide. With thanks to Gage for publishing his code 30 years ago, I suggest you consider putting this project on the shelf (for a while), and go on to coding a "binary search tree" or whatever to expand your familiarity with the language and its libraries. With its admitted flaws, there's this challenge (for instance). Careful about thinking that one project gives a broad overview of a language. Both frustration and delight await when you tackle many, diverse challenges. Cheers!
Again, I acknowledge your liberty to use this (1994) public domain code in any way that interests you (in 2024). I again suggest you press pause on this project and broaden the scope of applications you address in order to learn how to use C based 'approaches' and idioms when solving coding problems. (Back in 95-96 my task involved exchanging data (9600 baud) with a handheld scanner with VERY limited RAM. It is a mistake to assess things from the past based on modern norms.) Again, best wishes on your journey... Maybe there's another "Quake Inverse Square" algorithm waiting to be uncovered...

Stack Exchange Network

Byte Pair Encoding Compression in C89

2 Answers 2

10,000m

5,000m

2,000m

1,000m

10m

A "gift"

Suggested exercises

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Byte Pair Encoding Compression in C89

2 Answers 2

10,000m

5,000m

2,000m

1,000m

10m

A "gift"

Suggested exercises

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions