String compression implementation in C

Question 1

I implemented basic string compression algorithm that uses the counts of repeated characters. For example: the string aabcccccaaa would become a2b1c5a3. What do you think about this, is there a better way to do this?

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
char* compressString(char* arr, int size);
int main() {
 char arr[] = "aabcccccaa";
 char *str;
 printf("Before compression: %s\n", arr);
 str = compressString(arr, strlen(arr));
 printf("After compression: %s\n", str);
 // free allocated memory
 free(str);
 return 0;
}
char* compressString(char* str, int len) {
 char last = str[0];
 char *buf = (char *)malloc(len * sizeof(char));
 int count = 1;
 int j = 0;
 for (int i = 1; i < len; i++) {
 if (last == str[i]) {
 count++;
 } else {
 buf[j++] = last;
 buf[j++] += count + '0';
 last = str[i];
 count = 1;
 }
 }
 buf[j++] = last;
 buf[j] += count + '0';
 buf[j] = '0円';
 return buf;
}

Question 2

Thoughts: const correctness, comment on encoding of count (why single character/digit?), look into PackBits to see how to avoid encoding a length "for every source character change".

Question 3

Please don't update your code with changes after you've received answers, see What to do after receiving answers for more information.

Question 4

Seeing your edit to revision 5: it is detrimental to have encode run in more than one place.

Question 5

A couple problems

You don't allocate enough space for the return buffer. You need to allocate 2*len + 1 bytes to handle the worst cast scenario. The +1 is for the null terminating byte.
If the count goes above 9, you will output a non-digit character instead of a digit. If the count goes above 256, the digit will wrap around back to '1' and your compression will have failed to encode the original string.

Question 6

@CodeCrack: perhaps you should get a rough overview of existing compression algorithms before reinventing the wheel in a particularly non-circular shape. For one thing this would show you all the ways that have been invented for storing counts efficiently (variable-length encodings based on bytes, nibbles or bits, Huffman-encoding of counts, arithmetic coding of counts, and so on). Run-length encodings like yours are often found in bitmaps (unsurprisingly called RLE bitmaps); looking at those might give you interesting ideas. Most of your questions have already been answered ages ago...

Question 7

I don't know what will be put in buf following a character from the input - buf[j++] += count + '0'; will add to whatever has been there before. I do have an inkling the last count will be overwritten, anyway.

Question 8

Failing to document a non-NULL return value will need to be free()d is an even bigger foul than not checking the value from malloc() - actually freeing in spite of an immediately following return from main() is proper.
The string literal in main differs from your example.
The code presented does not work as specified: the string returned lacks the last count.
The handling of len 0 is insufficient.
As you are aware, "adding the digit to buf[j]" is wrong.

JS1 JS1 28.9k3 gold badges41 silver badges83 bronze badges · Accepted Answer · 2016-01-08 05:50:22Z

5

\$\begingroup\$

A couple problems

You don't allocate enough space for the return buffer. You need to allocate 2*len + 1 bytes to handle the worst cast scenario. The +1 is for the null terminating byte.
If the count goes above 9, you will output a non-digit character instead of a digit. If the count goes above 256, the digit will wrap around back to '1' and your compression will have failed to encode the original string.

Share

edited Jun 10, 2020 at 13:24

Community's user avatar

Community Bot

1

answered Jan 8, 2016 at 5:50

JS1's user avatar

JS1 JS1

28.9k3 gold badges41 silver badges83 bronze badges

\$\endgroup\$

2

4

\$\begingroup\$ @CodeCrack: perhaps you should get a rough overview of existing compression algorithms before reinventing the wheel in a particularly non-circular shape. For one thing this would show you all the ways that have been invented for storing counts efficiently (variable-length encodings based on bytes, nibbles or bits, Huffman-encoding of counts, arithmetic coding of counts, and so on). Run-length encodings like yours are often found in bitmaps (unsurprisingly called RLE bitmaps); looking at those might give you interesting ideas. Most of your questions have already been answered ages ago... \$\endgroup\$

DarthGizka
– DarthGizka

2016年01月08日 07:01:47 +00:00
Commented Jan 8, 2016 at 7:01
\$\begingroup\$ I don't know what will be put in buf following a character from the input - buf[j++] += count + '0'; will add to whatever has been there before. I do have an inkling the last count will be overwritten, anyway. \$\endgroup\$

greybeard
– greybeard

2020年12月22日 15:19:09 +00:00
Commented Dec 22, 2020 at 15:19

Add a comment |

Stack Exchange Network

String compression implementation in C

2 Answers 2

A couple problems

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

String compression implementation in C

2 Answers 2

A couple problems

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions