An implementation of a simple compression algorithm that's featured in a programming practice book.
My goals:
- Robust: All error conditions must be handled properly. The specification indicated in the self-documenting comment must be met exactly.
- Secure: No opportunities for buffer overflow in the program.
- Efficient: The implementation must be at most O(n) in time, be as efficient as possible in time, and result in using as little memory as possible.
- Compatible: The implementation should be able to compile and run on any desktop, workstation, or server with a C99 compiler.
- Well documented: The implementation should be clearly documented and re-usable.
Yes, this algorithm would probably never be used production. It was a test of my ability to write production quality C code. Is this efficient and secure C? Please provide any and all criticisms, including the way in which I documented the code.
a3compress.h:
#ifndef _A3COMPRESS_H_
#define _A3COMPRESS_H_
/* Compress an input string in the following manner:
For each contiguous range of repeated chars, print the char followed by a
decimal number indicating the number of repetitions.
NOTE:
This function calls malloc, it returns NULL if malloc failed.
In any other case the return value must be freed.
Example:
input: aaabbbbbccccaaa
output: a3b5c4a3
If the string contains a digit (0 - 9), the string will not be compressed,
because the resulting compressed string would be ambiguous. A copy of the
string is returned that must be freed.
If the compressed string would not be of shorter length than the original
string, a copy of the string is returned that must be freed. */
char * const a3compress (const char * const str);
#endif /* _A3COMPRESS_H_ */
a3compress.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "a3compress.h"
inline unsigned int count_digits (unsigned int n) {
if (n == 0) {return 1;}
unsigned int test = 1;
unsigned int digits = 0;
while (n >= test) {
test *= 10;
++digits;
}
return digits;
}
/* Compress an input string in the following manner:
For each contiguous range of repeated chars, print the char followed by a
decimal number indicating the number of repetitions.
NOTE:
This function calls malloc, it returns NULL if malloc failed.
In any other case the return value must be freed.
Example:
input: aaabbbbbccccaaa
output: a3b5c4a3
If the string contains a digit (0 - 9), the string will not be compressed,
because the resulting compressed string would be ambiguous. A copy of the
string is returned that must be freed.
If the compressed string would not be of shorter length than the original
string, a copy of the string is returned that must be freed. */
char * const a3compress (const char * const str) {
size_t len = strlen(str);
char * compressed;
/* Determine the size of the compressed string */
char curr = str[0];
size_t cnt = 1;
size_t idx = 0;
for (size_t i = 1; i < len; ++i) {
/* If the current char is a decimal digit, abort. The compressed string
would be ambiguous */
if (str[i] >= '0' && str[i] <= '9') {goto abort;}
/* If the current char is the same as the previous char, and this
isn't the last char in the input string, increase the count */
if ((str[i] == curr) && (i != len - 1)) {
++cnt;
} else {
/* If this is the last char in the input string, increase the
count */
if (i == len - 1) {++cnt;}
++idx;
if (idx > len - 1) {goto abort;}
/* Add the number of digits in the current count */
size_t digits = count_digits(cnt);
idx += digits;
if (idx > len - 1) {goto abort;}
/* Begin counting repeats of the newly seen char */
curr = str[i];
cnt = 1;
}
}
compressed = (char *)malloc((idx + 1) + 1);
if (compressed == NULL) {return compressed;}
curr = str[0];
cnt = 1;
idx = 0;
for (size_t i = 1; i < len; ++i) {
/* If the current char is the same as the previous char, and this
isn't the last char in the input string, increase the count */
if ((str[i] == curr) && (i != len - 1)) {
++cnt;
} else {
/* If this is the last char in the input string, increase the
count */
if (i == len - 1) {++cnt;}
/* Add the current char to the output string */
compressed[idx++] = curr;
/* Add the current count to the output string */
size_t digits = count_digits(cnt);
char digit_s[digits + 1];
size_t digit_len = snprintf(digit_s, digits + 1, "%d",
(unsigned int)cnt);
idx += strlcpy(&compressed[idx], digit_s, digit_len + 1);
/* Begin counting repeats of the newly seen char */
curr = str[i];
cnt = 1;
}
}
return compressed;
abort:
compressed = (char *)malloc(len + 1);
if (compressed == NULL) {return compressed;}
strlcpy(compressed, str, len + 1);
return compressed;
}
test.c
#include <stdio.h>
#include <stdlib.h>
#include "a3compress.h"
int main (const int argc, char * const argv[]) {
char * line = NULL;
size_t linecap = 0;
ssize_t linelen;
linelen = getline(&line, &linecap, stdin);
if (linelen <= 0) {perror(NULL), exit(1);}
if (line[linelen - 1] == '\n') {line[linelen - 1] = '0円';}
char * const compressed = a3compress(line);
if (compressed == NULL) {perror(NULL), exit(1);}
free(line);
printf("%s\n", compressed);
free(compressed);
}
4 Answers 4
Comments
The count_digits
function could need an explanation (what is its purpose?).
A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describe why something is done or why something is done in this specific way.
For example, the comment "If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?
malloc
Don't cast the malloc
return type..
Also, why malloc((idx + 1) + 1)
? Why not +2
, and what's the reason for adding 2 in the first place? It needs a comment.
Logic
You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.
Return value
In case something is wrong (like invalid input string 123
) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to have NULL
returned on error and errno
set to a reasonable value like EINVAL
.
Standard conformance
strlcpy
is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to use strncpy
instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).
-
\$\begingroup\$ Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion on
strlcpy
\$\endgroup\$OregonTrail– OregonTrail2014年09月04日 19:22:50 +00:00Commented Sep 4, 2014 at 19:22 -
\$\begingroup\$
strlcpy
andstrncpy
only have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other. \$\endgroup\$Deduplicator– Deduplicator2017年07月21日 13:40:58 +00:00Commented Jul 21, 2017 at 13:40
A few notes:
You use
size_t
everywhere else in your program besides incount_digits()
where you useunsigned int
. I would switch over and usesize_t
for consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.You do not modify the parameter passed to
count_digits
, so it should be declared asconst
.If you want to, you could "simplify" your
while
loops into afor
loop.while (n >= test) { test *= 10; ++digits; }
How I might do it (up to your discretion on whether or not you want to do this or not):
for (; n >= test; digits++) { test *= 10; }
Cast
cnt
into asize_t
for consistency:size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);
You are inconsistent on where you put your
const
int main (const int argc, char * const argv[])
Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variable
compressed
.
TL;DR:
enter image description here
-
\$\begingroup\$ Thanks for your suggestions. By the way,
const char * const argv[]
is not the same aschar * const argv[]
. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrotechar * const argv[]
, it means that the contents of arguments may be modified, but the pointers must not, this was my intention. \$\endgroup\$OregonTrail– OregonTrail2014年09月04日 19:28:24 +00:00Commented Sep 4, 2014 at 19:28 -
\$\begingroup\$ @OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment. \$\endgroup\$syb0rg– syb0rg2014年09月04日 19:29:31 +00:00Commented Sep 4, 2014 at 19:29
-
\$\begingroup\$ I respectfully disagree that appropriate use of
const
to indicate whether a pointer isconst
or its content isconst
is a matter of consistency. It has self-documenting semantics. \$\endgroup\$OregonTrail– OregonTrail2014年09月04日 19:32:11 +00:00Commented Sep 4, 2014 at 19:32 -
\$\begingroup\$ @OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference between
const
indicating if the pointer or the content isconst
. It's up to you to make the changes, but that's my opinion on it. \$\endgroup\$syb0rg– syb0rg2014年09月04日 19:35:22 +00:00Commented Sep 4, 2014 at 19:35 -
\$\begingroup\$ This would mean adding an explanation above every signature for
main
that I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the commentx = ++i; /* increments i, then stores the result in x */
\$\endgroup\$OregonTrail– OregonTrail2014年09月04日 19:38:05 +00:00Commented Sep 4, 2014 at 19:38
A few comments.
Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then call
realloc
at the end to recover wasted memory, if necessary.You have duplicated the description of the function. Either put it in the header or in the C file. If you put it in both the descriptions will diverge over time.
I don't see why the function must return a
const
string.Your
count_digits
should bestatic
notinline
. Whether toinline
can be decided better by the compiler.Use
isdigit
to detect numeric charsYour comments are rather noisy - conveying little of use. If you use functions the code becomes self documenting - eg if your code that does the "Add the current count to the output string" was extracted to a function named
add_count
with the string and number as parameters it would be obvious what it was doing without a comment.
Here is how I might approach the function (not tested)
char* a3compress (const char* const s)
{
const size_t len = strlen(s);
char *res = calloc(len + 1, 1);
if (res) {
char *comp = res;
for (int span = 1, n = 0; n < len; n += span) {
if (isdigit(s[n])) { /* digits confuse the string encoding */
memcpy(res, s, len + 1);
break;
}
*comp++ = s[n];
span = char_span(s, s[n]);
if (span > 1) {
comp += write_char_count(comp, span);
}
}
assert(comp <= res + len);
}
return res;
}
The two functions do the obvious things. The assert
is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.
count_digit()
has trouble with n
near UINT_MAX
as test *= 10
can overflow. Further, a special if (n == 0)
is not needed.
#define MAXUPOWER10 1000000000u
#define MAXUPOWER10LOG 9
inline unsigned count_digits(unsigned n) {
if (n >= MAXUPOWER10)
return MAXUPOWER10LOG + 1;
unsigned test = 1;
for (unsigned digits = 1; ; digits++) { // or do loop
test *= 10;
if (n < test) break;
}
return digits;
}
Certain other algorithmic simplifications possible.
Explore related questions
See similar questions with these tags.
<char Sequence => '<char><count>'+
Where<count>
is an actual number (not the text version of a number), remember that a char is just a very small integer (8 bits). Because you are using the text representation of a number you are using 8bits to represent 4 1/2 bits so you are wasting a lot of bits. There is a limit since a char can hold 0-256 this encoding breaks into sequence that are a max of 256 characters long: But 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' => 'AA' (A is the 65 character so that). This also allows number to be encoded. \$\endgroup\$<count>
is a non printable character it does not really matter. \$\endgroup\$