Simple compression algorithm

Question 1

An implementation of a simple compression algorithm that's featured in a programming practice book.

My goals:

Robust: All error conditions must be handled properly. The specification indicated in the self-documenting comment must be met exactly.
Secure: No opportunities for buffer overflow in the program.
Efficient: The implementation must be at most O(n) in time, be as efficient as possible in time, and result in using as little memory as possible.
Compatible: The implementation should be able to compile and run on any desktop, workstation, or server with a C99 compiler.
Well documented: The implementation should be clearly documented and re-usable.

Yes, this algorithm would probably never be used production. It was a test of my ability to write production quality C code. Is this efficient and secure C? Please provide any and all criticisms, including the way in which I documented the code.

a3compress.h:

#ifndef _A3COMPRESS_H_
#define _A3COMPRESS_H_
/* Compress an input string in the following manner:
 For each contiguous range of repeated chars, print the char followed by a
 decimal number indicating the number of repetitions.
 NOTE: 
 This function calls malloc, it returns NULL if malloc failed.
 In any other case the return value must be freed.
 Example:
 input: aaabbbbbccccaaa
 output: a3b5c4a3
 If the string contains a digit (0 - 9), the string will not be compressed,
 because the resulting compressed string would be ambiguous. A copy of the
 string is returned that must be freed.
 If the compressed string would not be of shorter length than the original
 string, a copy of the string is returned that must be freed. */
char * const a3compress (const char * const str);
#endif /* _A3COMPRESS_H_ */

a3compress.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "a3compress.h"
inline unsigned int count_digits (unsigned int n) {
 if (n == 0) {return 1;}
 unsigned int test = 1;
 unsigned int digits = 0;
 while (n >= test) {
 test *= 10;
 ++digits;
 } 
 return digits;
}
/* Compress an input string in the following manner:
 For each contiguous range of repeated chars, print the char followed by a
 decimal number indicating the number of repetitions.
 NOTE: 
 This function calls malloc, it returns NULL if malloc failed.
 In any other case the return value must be freed.
 Example:
 input: aaabbbbbccccaaa
 output: a3b5c4a3
 If the string contains a digit (0 - 9), the string will not be compressed,
 because the resulting compressed string would be ambiguous. A copy of the
 string is returned that must be freed.
 If the compressed string would not be of shorter length than the original
 string, a copy of the string is returned that must be freed. */
char * const a3compress (const char * const str) {
 size_t len = strlen(str);
 char * compressed;
 /* Determine the size of the compressed string */
 char curr = str[0];
 size_t cnt = 1;
 size_t idx = 0;
 for (size_t i = 1; i < len; ++i) {
 /* If the current char is a decimal digit, abort. The compressed string
 would be ambiguous */
 if (str[i] >= '0' && str[i] <= '9') {goto abort;}
 /* If the current char is the same as the previous char, and this
 isn't the last char in the input string, increase the count */
 if ((str[i] == curr) && (i != len - 1)) {
 ++cnt;
 } else {
 /* If this is the last char in the input string, increase the
 count */
 if (i == len - 1) {++cnt;}
 ++idx;
 if (idx > len - 1) {goto abort;} 
 /* Add the number of digits in the current count */
 size_t digits = count_digits(cnt);
 idx += digits;
 if (idx > len - 1) {goto abort;} 
 /* Begin counting repeats of the newly seen char */
 curr = str[i];
 cnt = 1;
 }
 }
 compressed = (char *)malloc((idx + 1) + 1);
 if (compressed == NULL) {return compressed;}
 curr = str[0];
 cnt = 1;
 idx = 0;
 for (size_t i = 1; i < len; ++i) {
 /* If the current char is the same as the previous char, and this
 isn't the last char in the input string, increase the count */
 if ((str[i] == curr) && (i != len - 1)) {
 ++cnt;
 } else {
 /* If this is the last char in the input string, increase the
 count */
 if (i == len - 1) {++cnt;}
 /* Add the current char to the output string */
 compressed[idx++] = curr;
 /* Add the current count to the output string */
 size_t digits = count_digits(cnt);
 char digit_s[digits + 1];
 size_t digit_len = snprintf(digit_s, digits + 1, "%d", 
 (unsigned int)cnt);
 idx += strlcpy(&compressed[idx], digit_s, digit_len + 1);
 /* Begin counting repeats of the newly seen char */
 curr = str[i];
 cnt = 1;
 }
 }
 return compressed;
abort:
 compressed = (char *)malloc(len + 1);
 if (compressed == NULL) {return compressed;}
 strlcpy(compressed, str, len + 1);
 return compressed;
}

test.c

#include <stdio.h>
#include <stdlib.h>
#include "a3compress.h"
int main (const int argc, char * const argv[]) {
 char * line = NULL;
 size_t linecap = 0;
 ssize_t linelen;
 linelen = getline(&line, &linecap, stdin);
 if (linelen <= 0) {perror(NULL), exit(1);}
 if (line[linelen - 1] == '\n') {line[linelen - 1] = '0円';}
 char * const compressed = a3compress(line);
 if (compressed == NULL) {perror(NULL), exit(1);}
 free(line);
 printf("%s\n", compressed);
 free(compressed);
}

Question 2

This seems to be a variant of RLE. Is there a reason to re-invent the wheel? Generally, trying to roll your own compression, hash or encryption algorithm is a bad idea since chances are very slim that you can improve on the existing stuff unless you're an expert in that field. But chances are high you're making something inefficient/insecure.

Question 3

Oh, never mind. I've overlooked your "It was a test of my ability to write production quality C code." statement.

Question 4

Better algorithm <char Sequence => '<char><count>'+ Where <count> is an actual number (not the text version of a number), remember that a char is just a very small integer (8 bits). Because you are using the text representation of a number you are using 8bits to represent 4 1/2 bits so you are wasting a lot of bits. There is a limit since a char can hold 0-256 this encoding breaks into sequence that are a max of 256 characters long: But 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' => 'AA' (A is the 65 character so that). This also allows number to be encoded.

Question 5

Note: Compressed text is usually not human readable so where <count> is a non printable character it does not really matter.

Question 6

Comments

The count_digits function could need an explanation (what is its purpose?).

A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describe why something is done or why something is done in this specific way.

For example, the comment "If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?

malloc

Don't cast the malloc return type..

Also, why malloc((idx + 1) + 1) ? Why not +2, and what's the reason for adding 2 in the first place? It needs a comment.

Logic

You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.

Return value

In case something is wrong (like invalid input string 123) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to have NULL returned on error and errno set to a reasonable value like EINVAL.

Standard conformance

strlcpy is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to use strncpy instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).

Question 7

Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion on strlcpy

Question 8

strlcpy and strncpy only have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other.

Question 9

A few notes:

You use size_t everywhere else in your program besides in count_digits() where you use unsigned int. I would switch over and use size_t for consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.
You do not modify the parameter passed to count_digits, so it should be declared as const.
If you want to, you could "simplify" your while loops into a for loop.
```
while (n >= test) {
 test *= 10;
 ++digits;
}
```
How I might do it (up to your discretion on whether or not you want to do this or not):
```
for (; n >= test; digits++)
{
 test *= 10;
}
```

Cast cnt into a size_t for consistency:

size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);

You are inconsistent on where you put your const
```
int main (const int argc, char * const argv[])
```
Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variable compressed.

TL;DR:

enter image description here

Question 10

Thanks for your suggestions. By the way, const char * const argv[] is not the same as char * const argv[]. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrote char * const argv[], it means that the contents of arguments may be modified, but the pointers must not, this was my intention.

Question 11

@OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment.

Question 12

I respectfully disagree that appropriate use of const to indicate whether a pointer is const or its content is const is a matter of consistency. It has self-documenting semantics.

Question 13

@OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference between const indicating if the pointer or the content is const. It's up to you to make the changes, but that's my opinion on it.

Question 14

This would mean adding an explanation above every signature for main that I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the comment x = ++i; /* increments i, then stores the result in x */

Question 15

A few comments.

Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then call realloc at the end to recover wasted memory, if necessary.
You have duplicated the description of the function. Either put it in the header or in the C file. If you put it in both the descriptions will diverge over time.
I don't see why the function must return a const string.
Your count_digits should be static not inline. Whether to inline can be decided better by the compiler.
Use isdigit to detect numeric chars
Your comments are rather noisy - conveying little of use. If you use functions the code becomes self documenting - eg if your code that does the "Add the current count to the output string" was extracted to a function named add_count with the string and number as parameters it would be obvious what it was doing without a comment.

Here is how I might approach the function (not tested)

char* a3compress (const char* const s)
{
 const size_t len = strlen(s);
 char *res = calloc(len + 1, 1);
 if (res) {
 char *comp = res;
 for (int span = 1, n = 0; n < len; n += span) {
 if (isdigit(s[n])) { /* digits confuse the string encoding */
 memcpy(res, s, len + 1);
 break;
 }
 *comp++ = s[n];
 span = char_span(s, s[n]);
 if (span > 1) {
 comp += write_char_count(comp, span);
 }
 }
 assert(comp <= res + len);
 }
 return res;
}

The two functions do the obvious things. The assert is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.

Question 16

count_digit() has trouble with n near UINT_MAX as test *= 10 can overflow. Further, a special if (n == 0) is not needed.

#define MAXUPOWER10 1000000000u
#define MAXUPOWER10LOG 9
inline unsigned count_digits(unsigned n) {
 if (n >= MAXUPOWER10) 
 return MAXUPOWER10LOG + 1;
 unsigned test = 1;
 for (unsigned digits = 1; ; digits++) { // or do loop
 test *= 10;
 if (n < test) break; 
 }
 return digits;
}

Certain other algorithmic simplifications possible.

DarkDust DarkDustDarkDust 4053 silver badges8 bronze badges · Answer 1 · 2014-09-04 09:49:11Z

Comments

The count_digits function could need an explanation (what is its purpose?).

A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describe why something is done or why something is done in this specific way.

For example, the comment "If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?

malloc

Don't cast the malloc return type..

Also, why malloc((idx + 1) + 1) ? Why not +2, and what's the reason for adding 2 in the first place? It needs a comment.

Logic

You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.

Return value

In case something is wrong (like invalid input string 123) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to have NULL returned on error and errno set to a reasonable value like EINVAL.

Standard conformance

strlcpy is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to use strncpy instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).

Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion on strlcpy
strlcpy and strncpy only have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other.

syb0rg syb0rgsyb0rg 21.9k10 gold badges113 silver badges192 bronze badges · Answer 2 · 2014-09-04 19:13:15Z

6

\$\begingroup\$

A few notes:

You use size_t everywhere else in your program besides in count_digits() where you use unsigned int. I would switch over and use size_t for consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.
You do not modify the parameter passed to count_digits, so it should be declared as const.
If you want to, you could "simplify" your while loops into a for loop.
```
while (n >= test) {
 test *= 10;
 ++digits;
}
```
How I might do it (up to your discretion on whether or not you want to do this or not):
```
for (; n >= test; digits++)
{
 test *= 10;
}
```

Cast cnt into a size_t for consistency:

size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);

You are inconsistent on where you put your const
```
int main (const int argc, char * const argv[])
```
Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variable compressed.

TL;DR:

enter image description here

Share

answered Sep 4, 2014 at 19:13

syb0rg's user avatar

syb0rg syb0rgsyb0rg

21.9k10 gold badges113 silver badges192 bronze badges

\$\endgroup\$

8

\$\begingroup\$ Thanks for your suggestions. By the way, const char * const argv[] is not the same as char * const argv[]. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrote char * const argv[], it means that the contents of arguments may be modified, but the pointers must not, this was my intention. \$\endgroup\$

OregonTrail
– OregonTrail

2014年09月04日 19:28:24 +00:00
Commented Sep 4, 2014 at 19:28
\$\begingroup\$ @OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment. \$\endgroup\$

syb0rg
– syb0rg

2014年09月04日 19:29:31 +00:00
Commented Sep 4, 2014 at 19:29
\$\begingroup\$ I respectfully disagree that appropriate use of const to indicate whether a pointer is const or its content is const is a matter of consistency. It has self-documenting semantics. \$\endgroup\$

OregonTrail
– OregonTrail

2014年09月04日 19:32:11 +00:00
Commented Sep 4, 2014 at 19:32
\$\begingroup\$ @OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference between const indicating if the pointer or the content is const. It's up to you to make the changes, but that's my opinion on it. \$\endgroup\$

syb0rg
– syb0rg

2014年09月04日 19:35:22 +00:00
Commented Sep 4, 2014 at 19:35
\$\begingroup\$ This would mean adding an explanation above every signature for main that I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the comment x = ++i; /* increments i, then stores the result in x */ \$\endgroup\$

OregonTrail
– OregonTrail

2014年09月04日 19:38:05 +00:00
Commented Sep 4, 2014 at 19:38

| Show 3 more comments

score 3 · Answer 3 · 2014-09-07 22:45:40Z

A few comments.

Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then call realloc at the end to recover wasted memory, if necessary.
You have duplicated the description of the function. Either put it in the header or in the C file. If you put it in both the descriptions will diverge over time.
I don't see why the function must return a const string.
Your count_digits should be static not inline. Whether to inline can be decided better by the compiler.
Use isdigit to detect numeric chars
Your comments are rather noisy - conveying little of use. If you use functions the code becomes self documenting - eg if your code that does the "Add the current count to the output string" was extracted to a function named add_count with the string and number as parameters it would be obvious what it was doing without a comment.

Here is how I might approach the function (not tested)

char* a3compress (const char* const s)
{
 const size_t len = strlen(s);
 char *res = calloc(len + 1, 1);
 if (res) {
 char *comp = res;
 for (int span = 1, n = 0; n < len; n += span) {
 if (isdigit(s[n])) { /* digits confuse the string encoding */
 memcpy(res, s, len + 1);
 break;
 }
 *comp++ = s[n];
 span = char_span(s, s[n]);
 if (span > 1) {
 comp += write_char_count(comp, span);
 }
 }
 assert(comp <= res + len);
 }
 return res;
}

The two functions do the obvious things. The assert is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.

chux chuxchux 36.4k2 gold badges43 silver badges96 bronze badges · Answer 4 · 2014-09-09 01:34:13Z

count_digit() has trouble with n near UINT_MAX as test *= 10 can overflow. Further, a special if (n == 0) is not needed.

#define MAXUPOWER10 1000000000u
#define MAXUPOWER10LOG 9
inline unsigned count_digits(unsigned n) {
 if (n >= MAXUPOWER10) 
 return MAXUPOWER10LOG + 1;
 unsigned test = 1;
 for (unsigned digits = 1; ; digits++) { // or do loop
 test *= 10;
 if (n < test) break; 
 }
 return digits;
}

Certain other algorithmic simplifications possible.

Stack Exchange Network

Simple compression algorithm

4 Answers 4

Comments

malloc

Logic

Return value

Standard conformance

TL;DR:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple compression algorithm

4 Answers 4

Comments

malloc

Logic

Return value

Standard conformance

TL;DR:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions