Algorithm for Run Length Encoding - String Compression

Question 1

I attempted a problem from the Cracking the Coding Interview book. The following input: aabcccaaa should give the following output: a2b1c3a3. Any advice is much appreciated!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char* compress(char*);
int main(int argc, char** argv)
{ 
 if(argc==2)
 {
 printf("The compression of %s is %s \n", argv[1],
 compress(argv[1]));
 }
 else
 printf("Not correct number of inputs.");
}
char* compress(char* str)
{
 const size_t len = strlen(str);
 char result[len];
 //initialized to 0, so length check can be done later
 memset(result, 0, len); 
 int counter;
 int j = 0;
 if(str==NULL || len == 0)
 return 0; 
 result[0] = str[0];
 for(size_t i = 0; i < len; i++)
 { 
 counter = 1;//reset counter
 //to make sure no array out of bounds exception occurs
 if(result[j+1]==0)
 result[++j] = counter + '0';
 else
 return str;
 while(str[i] == str[i+1])
 {
 //to store number in char array, add the asci value 
 // of 0.
 result[j] = (++counter + '0'); 
 i++;
 }
 if(result[j+1]==0)
 result[++j] = str[i+1];
 else 
 return str;
 }
 result[++j] = '0';
 //to avoid "function returns address of local variable" error
 //to also avoid altering original argument
 char* str1 = (char*)malloc(sizeof(result));
 strcpy(str1, result);
 return str1;
}

Question 2

This is called Run Length Encoding

Question 3

result should be len+1 bytes long (to include the terminating null character)

Question 4

Please do not add, remove, or edit code in a question after you've received an answer. The site policy is explained in What to do when someone answers.

Question 5

Oh sorry about that.

Question 6

No problem. Feel free to ask a follow-up question with the new code.

Question 7

You can eliminate that function prototype by defining main() last, but it's up to you.
It's more common to first check the command line usage and then exit with a non-zero value if it's invalid. You should also print all errors to stderr, not stdout (the latter is always used with printf()).
```
if (argc != 2)
{
 fprintf(stderr, "Not correct number of inputs.");
 return -1;
}
```
You can probably check str right at the start, especially with the len assignment. If there is an error, then it may be clearer to return NULL instead of 0. At least that's what many library functions return specifically.

The while loop can instead be a for loop:

for (; str[i] == str[i+1]; i++)
{
 //to store number in char array, add the asci value 
 // of 0.
 result[j] = (++counter + '0');
}

The NULL character for result is wrong; it should be '0円'.

There's actually no need to cast the return type of malloc() in C. Although malloc() does return a void*, it's not required because any pointer type can be assigned to a void* and is done "automatically" here.

Moreover, you can avoid that aforementioned error by allocating result after making sure that the original string is valid. This will also avoid the need to call strcpy() at the end and risk other bugs. Since result will not be local this way, you can return it from the function without losing the data.

char* compress(char* str)
{
 // check to make sure str is valid...
 // assuming len is also valid
 char* result = malloc(len);
 // OR, use calloc() instead to get the memory zeroed
 // thus no need for memset() or anything else
 char* result = calloc(len, sizeof(char));
 // probably also good to check for sufficient memory
 // if insufficient, print error and abort
 if (!result)
 {
 perror("malloc"); // or "calloc"
 abort();
 }
 // do the compression...
 // return it right away; no need to copy anything
 return result;
}

You never call free() on str1, so this function leaks memory. You would have to do this in main() with the returned string, and it may be necessary to assign compress() to a const char* to display before freeing afterward. This also means that compress() should return a const char*, not a char*, preventing it from being modified further.

It may work to have something like this in main():
```
int main(int argc, char** argv)
{
 if (argc != 2)
 {
 // let the user know of the correct input
 fprintf(stderr, "usage: %s string\n", argv[0]);
 return -1;
 }
 const char* original = argv[1];
 const char* compressed = compress(original);
 printf("The compression of %s is %s \n", original, compressed);
 free(compressed);
}
```

Question 8

Yes after you pointed out it leaks memory, i immediately did that. Thank you. I've added in the following lines of code.

char* result; 	if(argc==2) 	{ 		result = compress(argv[1]); 		free(compress(argv[1])); 		printf("The compression of %s is %s \n", argv[1], 			result); 	}

Is that better?

Question 9

@JonathanSmit: That would be ugly anyway. You can always create two separate const char*s for the original and compressed strings respectively. It would still be better than a memory leak. I can add something like that to my answer.

Question 10

What's the advantage of rewriting while (cond) {stmt; incr;} as for (; cond; incr) {stmt;}? The former seems much more natural and much clearer.

Question 11

@DavidRicherby Jamal likely has his own reasons for this, but considering I recently posted similar advice on one of Jamal's answers, I figure I might go ahead and add my own reasons. 1) It makes it trivial to glance at the loop and understand the flow of it (i.e. you can immediately tell how and when it increments -- at least assuming no shenanigans inside the loop). 2) It keeps the loop control around the loop instead of inside of it. 3) It keeps the same noisy ++i; boilerplate line out of 80% of loops.

Question 12

(And yeah, these do essentially all come down to "it looks better to me" and "my brain parses it more easily" just as the while version looks more natural and clear to you... :/)

Question 13

Bug: you assume that the result will never be longer than the original string while in fact it could be twice as long. For example abcd would yield a1b1c1d1.
Instead of using memset to initialize result you can also initialize it to 0 like this:
```
char result[len] = { 0 };
```
You don't have to assign the counter on every iteration to the result. Simply increment the counter while the letter doesn't change and then add the count afterwards.
Another bug: the way you convert the counter into a string will stop working for any number larger than 9. You want sprintf.
Also strlen returns the length of the string excluding the terminating NULL character so your result is short by 1 in any case.
Given the fairly severe bugs you really ought to test your code better.

Question 14

Hello, for the bug, I believe I had an else statement, that will catch that and return the original string if the compressed string is actually longer. Does the code not do that?

Question 15

@JonathanSmit You have some if else statements to return the original string if result is not 0 but that's just wrong since it implies an out of bound access which invokes undefined behavior. It may just happen to occasionally work but it's still broken.

Question 16

@ChrisWue: Might as well make it snprintf() (add the size protection). As for point #2, I'm not sure it always works that way with C, or maybe that's just me. At least memset() can be something to fall back on.

Question 17

@ChrisWue I added if((j+1) < len condition before each time i increment j and deference result. I also removed all the out of bounds access. Does that work better? Or should I be going about this differently?

Question 18

@JonathanSmit: Well, you should fix all the bugs, run some test cases - preferably with valgrind and then resubmit your code for review in a follow up question. I updated the question with another bug I found.

Question 19

Don't speak of one

I know it is against what the problem expects, but an improvement in your compression algorithm would be to not print any number if there is only one occurrence of the letter.

This way, it would impossible to encounter that first bug mentioned in ChrisWue's answer. And, in files that don't have that many consecutive letters, they will have less added bytes from the compression.

However, as noted in two comments, if the digits are numbers, you can run into some issues. However, after reading the next section, you can see that, if you have consecutive numbers, they will be ASCII numbers, where the values accompanying the characters will not be ASCII.

ASCII numbers

This is also against the challenge output.

Most compression services don't leave an intentionally readable output file in the end. In that case, there is not much need to have the numbers accompanying the letters be ASCII. That way, when it comes to large repetitions of characters (at least > 10), you will not be spending an extra byte or two.

Now, you may think: if the number got high enough, it might interfere with the file content. That is true; but your code will do this now too. However, when decompressing, it can be kept in mind that the compression will result in two byte pairs: one byte for the character, one byte the character occurrences.

UTF

Right now, your compression algorithm won't do so well when it comes to UTF files. Since some UTF characters actually exceed 1 byte, this algorithm would probably end up not compressing anything at all.

Now, this may be out of the scope of your problem, but it would be a nice addition. If you look at the UTF-8 wiki page, you can see that it is somewhat possible to detect if a character is UTF, judging by the binary digits at the beginning of the digit.

This, of course, would raise complications as UTF-8 characters are variable length and you would have to determine the length in bytes of the character by the first binary digits.

To ease things up, you could have the user specify whether or not the file is UTF encoded (or, you might be able to find out by reading the file).

Misc

Always use braces.

Question 20

Hello, I am just doing this problem to practice coding interview questions, but thank you for all the useful information!

Question 21

If you "don't speak of 1", then the string "23" is ambiguous. Does it mean three twos or a two and a three?

Question 22

Note that not including the repeat when it is 1 makes it harder to decode accurately, especially if the input contains digits: b2211z etc.

Question 23

@DavidRicherby and JonathanLeffler These are good points. I've edited my post.

Question 24

Bug. What happens if your input contains more than nine consecutive instances of the same character? What if it contains a lot more than nine? You don't do any bounds checking before assigning ++counter + '0' into a char variable and then printf-ing it back in main.

Jamal 35.2k13 gold badges134 silver badges238 bronze badges · Accepted Answer · 2015-12-31 21:12:41Z

You can eliminate that function prototype by defining main() last, but it's up to you.
It's more common to first check the command line usage and then exit with a non-zero value if it's invalid. You should also print all errors to stderr, not stdout (the latter is always used with printf()).
```
if (argc != 2)
{
 fprintf(stderr, "Not correct number of inputs.");
 return -1;
}
```
You can probably check str right at the start, especially with the len assignment. If there is an error, then it may be clearer to return NULL instead of 0. At least that's what many library functions return specifically.

The while loop can instead be a for loop:

for (; str[i] == str[i+1]; i++)
{
 //to store number in char array, add the asci value 
 // of 0.
 result[j] = (++counter + '0');
}

The NULL character for result is wrong; it should be '0円'.

There's actually no need to cast the return type of malloc() in C. Although malloc() does return a void*, it's not required because any pointer type can be assigned to a void* and is done "automatically" here.

Moreover, you can avoid that aforementioned error by allocating result after making sure that the original string is valid. This will also avoid the need to call strcpy() at the end and risk other bugs. Since result will not be local this way, you can return it from the function without losing the data.

char* compress(char* str)
{
 // check to make sure str is valid...
 // assuming len is also valid
 char* result = malloc(len);
 // OR, use calloc() instead to get the memory zeroed
 // thus no need for memset() or anything else
 char* result = calloc(len, sizeof(char));
 // probably also good to check for sufficient memory
 // if insufficient, print error and abort
 if (!result)
 {
 perror("malloc"); // or "calloc"
 abort();
 }
 // do the compression...
 // return it right away; no need to copy anything
 return result;
}

You never call free() on str1, so this function leaks memory. You would have to do this in main() with the returned string, and it may be necessary to assign compress() to a const char* to display before freeing afterward. This also means that compress() should return a const char*, not a char*, preventing it from being modified further.

It may work to have something like this in main():
```
int main(int argc, char** argv)
{
 if (argc != 2)
 {
 // let the user know of the correct input
 fprintf(stderr, "usage: %s string\n", argv[0]);
 return -1;
 }
 const char* original = argv[1];
 const char* compressed = compress(original);
 printf("The compression of %s is %s \n", original, compressed);
 free(compressed);
}
```

Yes after you pointed out it leaks memory, i immediately did that. Thank you. I've added in the following lines of code. char* result; if(argc==2) { result = compress(argv[1]); free(compress(argv[1])); printf("The compression of %s is %s \n", argv[1], result); } Is that better?
@JonathanSmit: That would be ugly anyway. You can always create two separate const char*s for the original and compressed strings respectively. It would still be better than a memory leak. I can add something like that to my answer.
What's the advantage of rewriting while (cond) {stmt; incr;} as for (; cond; incr) {stmt;}? The former seems much more natural and much clearer.
@DavidRicherby Jamal likely has his own reasons for this, but considering I recently posted similar advice on one of Jamal's answers, I figure I might go ahead and add my own reasons. 1) It makes it trivial to glance at the loop and understand the flow of it (i.e. you can immediately tell how and when it increments -- at least assuming no shenanigans inside the loop). 2) It keeps the loop control around the loop instead of inside of it. 3) It keeps the same noisy ++i; boilerplate line out of 80% of loops.
(And yeah, these do essentially all come down to "it looks better to me" and "my brain parses it more easily" just as the while version looks more natural and clear to you... :/)

Stack Exchange Network

Algorithm for Run Length Encoding - String Compression

4 Answers 4

Don't speak of one

ASCII numbers

UTF

Misc

You must log in to answer this question.

Linked

Hot Network Questions

Algorithm for Run Length Encoding - String Compression

4 Answers 4

Don't speak of one

ASCII numbers

UTF

Misc

You must log in to answer this question.

Linked

Related

Hot Network Questions