Simple compression reloaded++

Question 1

As a follow up to my previous question I've improved the code + algorithm. The compression now works the following way:

Each character is followed by a length byte. The top 3 bits of that byte denote the number of additional length bytes which encode the count. The count is stored in "little endian order" (least significant byte first) with the first 5 bits being encoded in the initial length byte.

Some examples:

Count == 1 yields in length byte 00000001
Count == 31 yields in length byte 00011111
Count == 32 yields in length bytes 00100000 followed by 0000001
Count == 33 yields in length bytes 00100001 followed by 0000001

The program supports a command line option for printing the output in hex for easier reading and debugging. I tried to use a const int for the NUM_LENGTH_BITS as well but then the compiler claimed that the initializer element for MAX_COUNT isn't constant. Not sure if that can be worked around.

I've limited the max count to 61 bits to leave 3 bits for the length specifier and have a maximum of 8 bytes for the length output.

All types of feedback appreciated.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#define NUM_LENGTH_BITS 3
static const uint64_t MAX_COUNT = (~0ULL) >> NUM_LENGTH_BITS;
static bool print_hex = false;
static int most_significant_bit(uint64_t value)
{
 int i = 0;
 while (value) 
 {
 ++i;
 value >>= 1;
 }
 return i;
}
static int num_bytes_required(uint64_t value)
{
 int msb = most_significant_bit(value);
 int bytes = msb / 8;
 int remainder = msb % 8;
 if (remainder)
 {
 bytes += (remainder + NUM_LENGTH_BITS) > 8 ? 2 : 1;
 }
 return bytes;
}
void write_char(int c)
{
 if (print_hex)
 {
 if (printf("0x%x ", c) < 0)
 {
 perror("error printing to stdout");
 exit(EXIT_FAILURE);
 }
 }
 else 
 {
 if (EOF == putchar(c) && ferror(stdout))
 {
 perror("error writing to stdout");
 exit(EXIT_FAILURE);
 }
 }
}
void write_count(uint64_t count)
{
 int additional_length_bytes = num_bytes_required(count) - 1;
 assert(additional_length_bytes >= 0 && additional_length_bytes < 8);
 int first = ((additional_length_bytes << 5) | (count & 0x1F)) & 0xFF;
 write_char(first);
 count >>= 5;
 while (count)
 {
 write_char(count & 0xFF);
 count >>= 8;
 }
}
static void parse_args(int argc, char **argv)
{
 if (argc == 2)
 {
 print_hex = (stricmp(argv[1], "-h") == 0 || stricmp(argv[1], "--hex") == 0);
 }
 else
 {
 printf("Invalid number of arguments passed: %d\n", argc - 1);
 exit(EXIT_FAILURE);
 }
}
int main(int argc, char** argv)
{
 int current_char = 0;
 int previous_char = 0;
 uint64_t current_char_count = 0;
 parse_args(argc, argv);
 while (true)
 {
 current_char = getchar();
 if (current_char_count == 0 || current_char_count == MAX_COUNT || previous_char != current_char)
 {
 if (current_char_count > 0)
 {
 write_count(current_char_count);
 }
 if (EOF != current_char)
 {
 write_char(current_char);
 current_char_count = 1;
 previous_char = current_char;
 }
 else 
 {
 break;
 }
 }
 else
 {
 current_char_count += 1;
 }
 }
}

Question 2

Follow up question

Question 3

Although your code looks rather nice, there always alternate ways of doing stuff with its pros and cons. I'll comment on some of this alternate solutions to your implementation:

Simplify write_count() – Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.

However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
```
const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;
void write_count(uint64_t count)
{
 int lower_bits = count & LOWER_BITS;
 while (lower_bits != count || count > LOWER_BITS) {
 write_char(lower_bits | HIGH_BIT);
 count >>= BIT_SHIFT;
 lower_bits = count & LOWER_BITS;
 }
 write_char(lower_bits);
}
```
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.

Here is an example run showing the transition from 127 to 128 repetitions:
```
echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 
0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1 
```
That is 1 e, 127 (0x7f) a's, 128 b's (0x80 0x1), and 1 c.
Avoid magic numbers – In your version of write_count() you use the magic numbers of 5 and 8, whilst you've defined the NUM_LENGTH_BITS. Be consistent, and try to avoid most use of magic numbers. If you changed your NUM_LENGTH_BITS the rest of your code would not follow that change.
Avoid repeating your self – In write_char() the perror() & exit() part is repeated. If you change to using only printf() this can be avoided, and simplified into:
```
void write_char(int c)
{
 if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
 perror("Error printing to stdout.");
 exit(EXIT_FAILURE);
 }
}
```
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the print_hex is awkward to pass as parameters everywhere. I tend to either uppercase them, or add G_ in front of them so that it becomes G_print_hex. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.
Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
- stricmp is limited to Windows – stricmp is Windows specific (see here), so please use strcasecmp instead which is compliant with C99
- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
- If you add more parameters the argc == 2 fails – For test purposes I tried adding a -v parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).
  
  A proper parameter handling would require a for loop, and should most likely at least include the -x / --hex parameters, as well as a -h / --help parameters. In addition to allowing no parameters.
Consider simplifying main() – A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.

I.e. if you renamed your current main() to encode(), you could easily add the decode() to the mix based on parameters. Having both of these within main() would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.
Try to reduce nesting of if's – This a little based on taste and personal opinions, but I tend to avoid using while (true) and break if possible, and try to keep the nesting of if's to a minimum.

In your case, this can be done using the following:
```
while ((current_char = getchar()) != EOF) {
 if (current_char == previous_char) {
 current_char_count += 1;
 } else {
 if (current_char_count > 0) {
 write_count(current_char_count);
 }
 write_char(current_char);
 current_char_count = 1;
 }
 previous_char = current_char;
}
if (current_char_count > 0) {
 write_count(current_char_count);
} 
```
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the if's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between the if and the else so that it is easier to see why we entered the else clause.

I've removed the reference related to MAX_COUNT as I consider it rather esoteric if you run this code on something with 2^64 repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additional write_count() at the end. Still I think this reads somewhat easier than your implementation.
Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in C, as I feel it make the code somewhat easier to read and slightly more compact.

Do however note that I still keep braces around one-line blocks, as you do.

Refactored code

Here is the complete refactored code (using opening braces on previous line):

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;
bool G_print_hex = false;
void write_char(int c)
{
 if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
 perror("Error printing to stdout.");
 exit(EXIT_FAILURE);
 }
}
void write_count(uint64_t count)
{
 int lower_bits = count & LOWER_BITS;
 while (lower_bits != count || count > LOWER_BITS) {
 write_char(lower_bits | HIGH_BIT);
 count >>= BIT_SHIFT;
 lower_bits = count & LOWER_BITS;
 }
 write_char(lower_bits);
}
void parse_args(int argc, char **argv)
{
 if (argc == 2) {
 G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);
 } else if (argc != 1) {
 printf("Invalid number of arguments passed: %d\n", argc - 1);
 exit(EXIT_FAILURE);
 }
}
int main(int argc, char** argv)
{
 int current_char = 0;
 int previous_char = -1; // A non-legal char, so it triggers a change on first character begin chr(0)
 uint64_t current_char_count = 0;
 parse_args(argc, argv);
 while ((current_char = getchar()) != EOF) {
 if (current_char == previous_char) {
 current_char_count += 1;
 } else {
 if (current_char_count > 0) {
 write_count(current_char_count);
 }
 write_char(current_char);
 current_char_count = 1;
 }
 previous_char = current_char;
 }
 if (current_char_count > 0) {
 write_count(current_char_count);
 } 
}

I've also used two blank lines in between functions to help them stand out a little more, and I've been very lazy in writing comments.

_{PS! Using previous_char=-1 one avoids a bug when testing this with dd if=/dev/zero bs=1 count=65, as getchar() only returns unsigned char's or EOF. But previous_char is declared as an int, and as such using -1 as a start value, is safer than the original 0.}

Question 4

Yeah, the command line thing I tacked on as a quick afterthought - I guess it shows. Good catch on the stricmp and the argument count checking.

Question 5

A few minor notes (looks pretty good for the most part!):

Your most_significant_bit() function loop should be condensed into a for loop. The main difference between the for's and the while's is a matter of pragmatics: we usually use for when there is a known number of iterations (which might not seem like the case here, but it's true), and use while constructs when the number of iterations in not known in advance.
I'm not a fan of you using -h as the command line argument equivalent to --hex. Whenever I think of -h, I think of it as the shorthand of --help. A better shorthand would be -x in my opinion.
Why do you do current_char_count += 1? Why not current_char_count++/++current_char_count (both do the same thing in the same amount of time in this case)?

Question 6

The for loop I disagree with - I find the while loop in this case semantically more appropriate (if I explain the algorithm to myself in plain english it goes something like "while there are still bits set, shift value to right by one and increment the counter" - therefore I use a while loop).

Question 7

Good point about the -h. Last one: depends what kind of code I've been reading in the most recent past. Currently I prefer var += 1 if it's a standalone statement. Technically the pre- instead of the post-fix increment should be used if any since the post-fix operator technically requires that a copy should be made (since the value before the operation has to be returned) - admittedly in this case the compiler will very likely optimize this away but still.

Question 8

A post fix operator translates to a copy and increment while a prefix operator translates to an increment and copy. So they are both equally inexpensive (and in this case there is not even something to copy to).

Question 9

@ChrisWue I guess it's all about how you understand the code then. For me, for loops are super easy to read, even in plain English. Plus the counter variable (which is usually just i, but could vary) is more obvious to me and reduced in scope.

Question 10

@syb0rg: sure, except in this case the loop counter would need to live beyond the score of the loop so it can be returned (unless I misunderstand the way you intended to use the four loop)

holroy holroyholroy 11.7k1 gold badge27 silver badges59 bronze badges · Accepted Answer · 2016-01-04 12:31:28Z

Although your code looks rather nice, there always alternate ways of doing stuff with its pros and cons. I'll comment on some of this alternate solutions to your implementation:

Simplify write_count() – Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.

However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
```
const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;
void write_count(uint64_t count)
{
 int lower_bits = count & LOWER_BITS;
 while (lower_bits != count || count > LOWER_BITS) {
 write_char(lower_bits | HIGH_BIT);
 count >>= BIT_SHIFT;
 lower_bits = count & LOWER_BITS;
 }
 write_char(lower_bits);
}
```
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.

Here is an example run showing the transition from 127 to 128 repetitions:
```
echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 
0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1 
```
That is 1 e, 127 (0x7f) a's, 128 b's (0x80 0x1), and 1 c.
Avoid magic numbers – In your version of write_count() you use the magic numbers of 5 and 8, whilst you've defined the NUM_LENGTH_BITS. Be consistent, and try to avoid most use of magic numbers. If you changed your NUM_LENGTH_BITS the rest of your code would not follow that change.
Avoid repeating your self – In write_char() the perror() & exit() part is repeated. If you change to using only printf() this can be avoided, and simplified into:
```
void write_char(int c)
{
 if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
 perror("Error printing to stdout.");
 exit(EXIT_FAILURE);
 }
}
```
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the print_hex is awkward to pass as parameters everywhere. I tend to either uppercase them, or add G_ in front of them so that it becomes G_print_hex. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.
Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
- stricmp is limited to Windows – stricmp is Windows specific (see here), so please use strcasecmp instead which is compliant with C99
- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
- If you add more parameters the argc == 2 fails – For test purposes I tried adding a -v parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).
  
  A proper parameter handling would require a for loop, and should most likely at least include the -x / --hex parameters, as well as a -h / --help parameters. In addition to allowing no parameters.
Consider simplifying main() – A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.

I.e. if you renamed your current main() to encode(), you could easily add the decode() to the mix based on parameters. Having both of these within main() would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.
Try to reduce nesting of if's – This a little based on taste and personal opinions, but I tend to avoid using while (true) and break if possible, and try to keep the nesting of if's to a minimum.

In your case, this can be done using the following:
```
while ((current_char = getchar()) != EOF) {
 if (current_char == previous_char) {
 current_char_count += 1;
 } else {
 if (current_char_count > 0) {
 write_count(current_char_count);
 }
 write_char(current_char);
 current_char_count = 1;
 }
 previous_char = current_char;
}
if (current_char_count > 0) {
 write_count(current_char_count);
} 
```
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the if's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between the if and the else so that it is easier to see why we entered the else clause.

I've removed the reference related to MAX_COUNT as I consider it rather esoteric if you run this code on something with 2^64 repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additional write_count() at the end. Still I think this reads somewhat easier than your implementation.
Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in C, as I feel it make the code somewhat easier to read and slightly more compact.

Do however note that I still keep braces around one-line blocks, as you do.

Refactored code

Here is the complete refactored code (using opening braces on previous line):

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;
bool G_print_hex = false;
void write_char(int c)
{
 if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
 perror("Error printing to stdout.");
 exit(EXIT_FAILURE);
 }
}
void write_count(uint64_t count)
{
 int lower_bits = count & LOWER_BITS;
 while (lower_bits != count || count > LOWER_BITS) {
 write_char(lower_bits | HIGH_BIT);
 count >>= BIT_SHIFT;
 lower_bits = count & LOWER_BITS;
 }
 write_char(lower_bits);
}
void parse_args(int argc, char **argv)
{
 if (argc == 2) {
 G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);
 } else if (argc != 1) {
 printf("Invalid number of arguments passed: %d\n", argc - 1);
 exit(EXIT_FAILURE);
 }
}
int main(int argc, char** argv)
{
 int current_char = 0;
 int previous_char = -1; // A non-legal char, so it triggers a change on first character begin chr(0)
 uint64_t current_char_count = 0;
 parse_args(argc, argv);
 while ((current_char = getchar()) != EOF) {
 if (current_char == previous_char) {
 current_char_count += 1;
 } else {
 if (current_char_count > 0) {
 write_count(current_char_count);
 }
 write_char(current_char);
 current_char_count = 1;
 }
 previous_char = current_char;
 }
 if (current_char_count > 0) {
 write_count(current_char_count);
 } 
}

I've also used two blank lines in between functions to help them stand out a little more, and I've been very lazy in writing comments.

_{PS! Using previous_char=-1 one avoids a bug when testing this with dd if=/dev/zero bs=1 count=65, as getchar() only returns unsigned char's or EOF. But previous_char is declared as an int, and as such using -1 as a start value, is safer than the original 0.}

Yeah, the command line thing I tacked on as a quick afterthought - I guess it shows. Good catch on the stricmp and the argument count checking.

Stack Exchange Network

Simple compression reloaded++

2 Answers 2

Refactored code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Simple compression reloaded++

2 Answers 2

Refactored code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions