Simple compression on steroids - now with decompression

Question 1

A third follow up to the simple compression implementation. Adopted most of @holroy's suggestions. Main changes:

Compression now uses high-bit to indicate that additional length bytes are still following
Added decompression support
Should no longer depend on CHAR_BITS == 8 which increases portability
Use of unsigned long long since exact type size is not really important
Added additional options for printing command line help and decompression
Options are no longer global, instead an options object is being passed around

And here comes the code:

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <strings.h>
#include <limits.h>
#define BIT_SHIFT (CHAR_BIT - 1)
const int HIGH_BIT = 1 << BIT_SHIFT;
const int LOWER_BITS = ~(~0 << BIT_SHIFT);
typedef struct
{
 bool print_hex : 1;
 bool decompress : 1;
 bool print_help : 1;
} options;
void write_char(int c, options *opts)
{
 if (printf(opts->print_hex ? "0x%x " : "%c", c) < 0)
 {
 perror("Error printing to stdout.");
 exit(EXIT_FAILURE);
 }
}
void write_count(unsigned long long count, options *opts)
{
 int lower_bits = count & LOWER_BITS;
 while (lower_bits != count || count > LOWER_BITS)
 {
 write_char(lower_bits | HIGH_BIT, opts);
 count >>= BIT_SHIFT;
 lower_bits = count & LOWER_BITS;
 }
 write_char(lower_bits, opts);
}
bool is_one_of(const char *value, const char *short_option, const char *long_option)
{
 return (strcasecmp(value, short_option) == 0 || strcasecmp(value, long_option) == 0);
}
options parse_args(int argc, char **argv)
{
 options opts = { 0 };
 for (int i = 1; i < argc; ++i)
 {
 if (is_one_of(argv[i], "-x", "--hex"))
 {
 opts.print_hex = true;
 }
 else if (is_one_of(argv[i], "-h", "--help"))
 {
 opts.print_help = true;
 }
 else if (is_one_of(argv[i], "-e", "--decompress"))
 {
 opts.decompress = true;
 }
 else
 {
 fprintf(stderr, "Invalid option %s\n", argv[i]);
 opts.print_help = true;
 }
 }
 return opts;
}
void compress(options *opts)
{
 int current_char = 0;
 int previous_char = 0;
 unsigned long long current_char_count = 0;
 while (EOF != (current_char = getchar()))
 {
 if (current_char_count > 0 && current_char_count < ULLONG_MAX && current_char == previous_char)
 {
 current_char_count += 1;
 } 
 else
 {
 if (current_char_count > 0)
 {
 write_count(current_char_count, opts);
 }
 write_char(current_char, opts);
 current_char_count = 1;
 }
 previous_char = current_char;
 }
 // write remainder
 if (current_char_count > 0)
 {
 write_count(current_char_count, opts);
 } 
}
unsigned long long read_length()
{
 unsigned long long length = 0;
 bool expect_more_length_bytes = true;
 while (expect_more_length_bytes)
 {
 int c = getchar();
 if (c == EOF)
 {
 fprintf(stderr, "unexpected end of stream, expected more length bytes\n");
 exit(EXIT_FAILURE);
 }
 expect_more_length_bytes = (c & HIGH_BIT) != 0;
 length <<= BIT_SHIFT;
 length |= (c & LOWER_BITS);
 }
 return length;
}
void decompress(options *opts)
{
 int current_char = 0;
 while (EOF != (current_char = getchar()))
 {
 unsigned long long length = read_length();
 for (unsigned long long i = 0; i < length; ++i)
 {
 write_char(current_char, opts);
 }
 }
}
void print_help(char *program_name)
{
 printf("Usage: %s [options]\n", program_name);
 puts("Options:");
 puts(" -h | --help\t\tprint this help");
 puts(" -x | --hex\t\tprint all characters in hexadecimal form separated by spaces");
 puts(" -e | --decompress\tdecompress the input");
 puts("by default the input from stdin will be compresssed and written to stdout");
}
int main(int argc, char** argv)
{
 options opts = parse_args(argc, argv);
 if (opts.print_help)
 {
 print_help(argv[0]);
 } 
 else if (opts.decompress)
 {
 decompress(&opts);
 }
 else 
 {
 compress(&opts);
 }
}

As usual any suggestions welcome (portability, standard conformity, best practices, etc.).

Question 2

When I run the source code through the compiled program, the result is almost twice the size of the original. That does not meet my definition of compression!

Question 3

@Edward: Yeah, compression is probably not quite the right word - more like run length encoding. This program was originally inspired just by some other code review question. And believe it or not - I have an application where the input is suited to this and compresses quite nicely without the need of having to add a dependency on some external ZIP library.

Question 4

I see a number of things that may help you improve your code. First, nice job on error handling! Too many programs posted for review here don't have any.

Consider portability

The code is currently using strcasecmp which is a POSIX rather than a C standard library. That's not necessarily a problem unless you want to compile and run the code in a non-POSIX environment. If you'd like to use only standard stuff, you could use strcmp in <string.h> instead. If you are only intending to run this on POSIX-like environments, you might want to use getopt instead of rolling your own option parser. Also consider whether you really want -e and -E to be the same thing. I'd recommend against that.

Avoid bitfields

Let the compiler handle the packing of bools within your struct. It's not size critical in this application and it may well be faster for the compiler to handle it some other way other than as a bitfield.

Consider using a better algorithm

As I mentioned in the comment, this program actually expands rather than compressing in many cases. We can do better! The current scheme stores this way:

[count][byte]

However, unless there at least three repetitions of the byte in the original file, this will actually blow up rather than compress. Consider the following alternative scheme:

[marker][count][byte]

OR

[byte]

Choose a specific MARKER byte that occurs very infrequently in the input. For compression, count occurrences as you currently do. Then when either there are at least 4 repetitions (making it worthwhile to encode this way) or the input byte was the MARKER character itself, encode using the first scheme. Otherwise, just write the original input byte. We can add just these two globals:

const char MARKER = '`';
const unsigned MIN_COMPRESS_BYTES = 3;

I've added this new routine to simplify the compression code:

void emit_token(options *opts, int current_char, unsigned long long current_char_count)
{
 if (current_char_count > MIN_COMPRESS_BYTES || current_char == MARKER)
 {
 write_char(MARKER, opts);
 write_count(current_char_count, opts);
 current_char_count = 1;
 }
 for ( ; current_char_count; --current_char_count) {
 write_char(current_char, opts);
 }
}

There are only minor changes to the compress routine:

void compress(options *opts)
{
 int current_char = 0;
 int previous_char = 0;
 unsigned long long current_char_count = 0;
 while (EOF != (current_char = getchar()))
 {
 if (current_char_count > 0 && current_char_count < ULLONG_MAX && current_char == previous_char)
 {
 current_char_count += 1;
 } 
 else
 {
 emit_token(opts, previous_char, current_char_count);
 current_char_count = 1;
 }
 previous_char = current_char;
 }
 // write remainder
 if (current_char_count)
 emit_token(opts, previous_char, current_char_count);
}

Even fewer changes were required for the decompress routine:

void decompress(options *opts)
{
 int current_char = 0;
 while (EOF != (current_char = getchar()))
 {
 if (current_char != MARKER) {
 write_char(current_char, opts);
 } else {
 unsigned long long length = read_length();
 for (current_char = getchar(); length; --length) {
 write_char(current_char, opts);
 }
 }
 }
}

Results

Now instead of making the file almost twice the size, when I run the program on its own source code (4269 bytes on my machine), the resulting file is 3867 bytes long. For data on which the original code worked well, this modification should still work well. For data that the original code expanded, the worst case is considerably better.

Question 5

#include <strings.h>

This header doesn't exist under non-POSIX implementations. It should be #include <string.h>.

strcasecmp(value, short_option)

strcasecmp is not in the C Standard Library. It is only defined in the POSIX implementations. For portability, you should implement your own wrapper function that calls the strcasecmp function or your own equivalent function depending on if we're linking to POSIX.

int mystrcasecmp(const char *a, const char *b)
{
 #ifdef POSIX
 return strcasecmp(a, b);
 #else
 return mycustomstrcasecmp(a, b);
 #endif
}
typedef struct
{
 bool print_hex : 1;
 bool decompress : 1;
 bool print_help : 1;
} options;

I personally dislike bitfields for many of the reasons listed here. I would just use a nice and clear implementation like this:

typedef struct 
{
 bool print_hex;
 bool decompress;
 bool print_help;
} options;

If you really feel like twiddling with bits, use an unsigned char and assign a bit to each option. Then mask for each of them.

typedef struct
{
 // 1 is print_hex (mask of 0xFE)
 // 2 is decompress (mask of 0xFD)
 // 4 is print_help (mask of 0xFC)
 unsigned char ops;
} options;

Question 6

As stated in the help output - compressing the input is the default behaviour so that is not bug - it's a feature :).

Question 7

Actually I should probably get rid of the case insensitive comparison - options are generally treated as case sensitive (at least on Unix like systems)

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2016-01-05 22:46:53Z

I see a number of things that may help you improve your code. First, nice job on error handling! Too many programs posted for review here don't have any.

Consider portability

The code is currently using strcasecmp which is a POSIX rather than a C standard library. That's not necessarily a problem unless you want to compile and run the code in a non-POSIX environment. If you'd like to use only standard stuff, you could use strcmp in <string.h> instead. If you are only intending to run this on POSIX-like environments, you might want to use getopt instead of rolling your own option parser. Also consider whether you really want -e and -E to be the same thing. I'd recommend against that.

Avoid bitfields

Let the compiler handle the packing of bools within your struct. It's not size critical in this application and it may well be faster for the compiler to handle it some other way other than as a bitfield.

Consider using a better algorithm

As I mentioned in the comment, this program actually expands rather than compressing in many cases. We can do better! The current scheme stores this way:

[count][byte]

However, unless there at least three repetitions of the byte in the original file, this will actually blow up rather than compress. Consider the following alternative scheme:

[marker][count][byte]

OR

[byte]

Choose a specific MARKER byte that occurs very infrequently in the input. For compression, count occurrences as you currently do. Then when either there are at least 4 repetitions (making it worthwhile to encode this way) or the input byte was the MARKER character itself, encode using the first scheme. Otherwise, just write the original input byte. We can add just these two globals:

const char MARKER = '`';
const unsigned MIN_COMPRESS_BYTES = 3;

I've added this new routine to simplify the compression code:

void emit_token(options *opts, int current_char, unsigned long long current_char_count)
{
 if (current_char_count > MIN_COMPRESS_BYTES || current_char == MARKER)
 {
 write_char(MARKER, opts);
 write_count(current_char_count, opts);
 current_char_count = 1;
 }
 for ( ; current_char_count; --current_char_count) {
 write_char(current_char, opts);
 }
}

There are only minor changes to the compress routine:

void compress(options *opts)
{
 int current_char = 0;
 int previous_char = 0;
 unsigned long long current_char_count = 0;
 while (EOF != (current_char = getchar()))
 {
 if (current_char_count > 0 && current_char_count < ULLONG_MAX && current_char == previous_char)
 {
 current_char_count += 1;
 } 
 else
 {
 emit_token(opts, previous_char, current_char_count);
 current_char_count = 1;
 }
 previous_char = current_char;
 }
 // write remainder
 if (current_char_count)
 emit_token(opts, previous_char, current_char_count);
}

Even fewer changes were required for the decompress routine:

void decompress(options *opts)
{
 int current_char = 0;
 while (EOF != (current_char = getchar()))
 {
 if (current_char != MARKER) {
 write_char(current_char, opts);
 } else {
 unsigned long long length = read_length();
 for (current_char = getchar(); length; --length) {
 write_char(current_char, opts);
 }
 }
 }
}

Results

Now instead of making the file almost twice the size, when I run the program on its own source code (4269 bytes on my machine), the resulting file is 3867 bytes long. For data on which the original code worked well, this modification should still work well. For data that the original code expanded, the worst case is considerably better.

Stack Exchange Network

Simple compression on steroids - now with decompression

2 Answers 2

Consider portability

Avoid bitfields

Consider using a better algorithm

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Simple compression on steroids - now with decompression

2 Answers 2

Consider portability

Avoid bitfields

Consider using a better algorithm

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions