I implemented basic string compression algorithm that uses the counts of repeated characters. For example: the string aabcccccaaa would become a2b1c5a3. What do you think about this, is there a better way to do this?
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
char* compressString(char* arr, int size);
int main() {
char arr[] = "aabcccccaa";
char *str;
printf("Before compression: %s\n", arr);
str = compressString(arr, strlen(arr));
printf("After compression: %s\n", str);
// free allocated memory
free(str);
return 0;
}
char* compressString(char* str, int len) {
char last = str[0];
char *buf = (char *)malloc(len * sizeof(char));
int count = 1;
int j = 0;
for (int i = 1; i < len; i++) {
if (last == str[i]) {
count++;
} else {
buf[j++] = last;
buf[j++] += count + '0';
last = str[i];
count = 1;
}
}
buf[j++] = last;
buf[j] += count + '0';
buf[j] = '0円';
return buf;
}
2 Answers 2
A couple problems
You don't allocate enough space for the return buffer. You need to allocate
2*len + 1
bytes to handle the worst cast scenario. The+1
is for the null terminating byte.If the count goes above 9, you will output a non-digit character instead of a digit. If the count goes above 256, the digit will wrap around back to
'1'
and your compression will have failed to encode the original string.
-
4\$\begingroup\$ @CodeCrack: perhaps you should get a rough overview of existing compression algorithms before reinventing the wheel in a particularly non-circular shape. For one thing this would show you all the ways that have been invented for storing counts efficiently (variable-length encodings based on bytes, nibbles or bits, Huffman-encoding of counts, arithmetic coding of counts, and so on). Run-length encodings like yours are often found in bitmaps (unsurprisingly called RLE bitmaps); looking at those might give you interesting ideas. Most of your questions have already been answered ages ago... \$\endgroup\$DarthGizka– DarthGizka2016年01月08日 07:01:47 +00:00Commented Jan 8, 2016 at 7:01
-
\$\begingroup\$ I don't know what will be put in
buf
following a character from the input -buf[j++] += count + '0';
will add to whatever has been there before. I do have an inkling the last count will be overwritten, anyway. \$\endgroup\$greybeard– greybeard2020年12月22日 15:19:09 +00:00Commented Dec 22, 2020 at 15:19
Failing to document a non-NULL
return value will need to be free()
d is an even bigger foul than not checking the value from malloc()
- actually freeing in spite of an immediately following return from main()
is proper.
The string literal in main differs from your example.
The code presented does not work as specified: the string returned lacks the last count.
The handling of len
0 is insufficient.
As you are aware, "adding the digit to buf[j]
" is wrong.
count
(why single character/digit?), look into PackBits to see how to avoid encoding a length "for every source character change". \$\endgroup\$