I attempted a problem from the Cracking the Coding Interview book. The following input: aabcccaaa
should give the following output: a2b1c3a3
. Any advice is much appreciated!
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char* compress(char*);
int main(int argc, char** argv)
{
if(argc==2)
{
printf("The compression of %s is %s \n", argv[1],
compress(argv[1]));
}
else
printf("Not correct number of inputs.");
}
char* compress(char* str)
{
const size_t len = strlen(str);
char result[len];
//initialized to 0, so length check can be done later
memset(result, 0, len);
int counter;
int j = 0;
if(str==NULL || len == 0)
return 0;
result[0] = str[0];
for(size_t i = 0; i < len; i++)
{
counter = 1;//reset counter
//to make sure no array out of bounds exception occurs
if(result[j+1]==0)
result[++j] = counter + '0';
else
return str;
while(str[i] == str[i+1])
{
//to store number in char array, add the asci value
// of 0.
result[j] = (++counter + '0');
i++;
}
if(result[j+1]==0)
result[++j] = str[i+1];
else
return str;
}
result[++j] = '0';
//to avoid "function returns address of local variable" error
//to also avoid altering original argument
char* str1 = (char*)malloc(sizeof(result));
strcpy(str1, result);
return str1;
}
4 Answers 4
You can eliminate that function prototype by defining
main()
last, but it's up to you.It's more common to first check the command line usage and then exit with a non-zero value if it's invalid. You should also print all errors to
stderr
, notstdout
(the latter is always used withprintf()
).if (argc != 2) { fprintf(stderr, "Not correct number of inputs."); return -1; }
You can probably check
str
right at the start, especially with thelen
assignment. If there is an error, then it may be clearer to returnNULL
instead of0
. At least that's what many library functions return specifically.The
while
loop can instead be afor
loop:for (; str[i] == str[i+1]; i++) { //to store number in char array, add the asci value // of 0. result[j] = (++counter + '0'); }
The NULL character for
result
is wrong; it should be'0円'
.There's actually no need to cast the return type of
malloc()
in C. Althoughmalloc()
does return avoid*
, it's not required because any pointer type can be assigned to avoid*
and is done "automatically" here.Moreover, you can avoid that aforementioned error by allocating
result
after making sure that the original string is valid. This will also avoid the need to callstrcpy()
at the end and risk other bugs. Sinceresult
will not be local this way, you can return it from the function without losing the data.char* compress(char* str) { // check to make sure str is valid... // assuming len is also valid char* result = malloc(len); // OR, use calloc() instead to get the memory zeroed // thus no need for memset() or anything else char* result = calloc(len, sizeof(char)); // probably also good to check for sufficient memory // if insufficient, print error and abort if (!result) { perror("malloc"); // or "calloc" abort(); } // do the compression... // return it right away; no need to copy anything return result; }
You never call
free()
onstr1
, so this function leaks memory. You would have to do this inmain()
with the returned string, and it may be necessary to assigncompress()
to aconst char*
to display beforefree
ing afterward. This also means thatcompress()
should return aconst char*
, not achar*
, preventing it from being modified further.It may work to have something like this in
main()
:int main(int argc, char** argv) { if (argc != 2) { // let the user know of the correct input fprintf(stderr, "usage: %s string\n", argv[0]); return -1; } const char* original = argv[1]; const char* compressed = compress(original); printf("The compression of %s is %s \n", original, compressed); free(compressed); }
-
\$\begingroup\$ Yes after you pointed out it leaks memory, i immediately did that. Thank you. I've added in the following lines of code.
char* result; if(argc==2) { result = compress(argv[1]); free(compress(argv[1])); printf("The compression of %s is %s \n", argv[1], result); }
Is that better? \$\endgroup\$user85591– user855912015年12月31日 21:30:36 +00:00Commented Dec 31, 2015 at 21:30 -
1\$\begingroup\$ @JonathanSmit: That would be ugly anyway. You can always create two separate
const char*
s for the original and compressed strings respectively. It would still be better than a memory leak. I can add something like that to my answer. \$\endgroup\$Jamal– Jamal2015年12月31日 21:35:56 +00:00Commented Dec 31, 2015 at 21:35 -
1\$\begingroup\$ What's the advantage of rewriting
while (cond) {stmt; incr;}
asfor (; cond; incr) {stmt;}
? The former seems much more natural and much clearer. \$\endgroup\$David Richerby– David Richerby2016年01月01日 12:57:20 +00:00Commented Jan 1, 2016 at 12:57 -
1\$\begingroup\$ @DavidRicherby Jamal likely has his own reasons for this, but considering I recently posted similar advice on one of Jamal's answers, I figure I might go ahead and add my own reasons. 1) It makes it trivial to glance at the loop and understand the flow of it (i.e. you can immediately tell how and when it increments -- at least assuming no shenanigans inside the loop). 2) It keeps the loop control around the loop instead of inside of it. 3) It keeps the same noisy
++i;
boilerplate line out of 80% of loops. \$\endgroup\$Corbin– Corbin2016年01月02日 01:53:54 +00:00Commented Jan 2, 2016 at 1:53 -
1\$\begingroup\$ (And yeah, these do essentially all come down to "it looks better to me" and "my brain parses it more easily" just as the
while
version looks more natural and clear to you... :/) \$\endgroup\$Corbin– Corbin2016年01月02日 01:54:27 +00:00Commented Jan 2, 2016 at 1:54
Bug: you assume that the result will never be longer than the original string while in fact it could be twice as long. For example
abcd
would yielda1b1c1d1
.Instead of using
memset
to initializeresult
you can also initialize it to 0 like this:char result[len] = { 0 };
You don't have to assign the counter on every iteration to the result. Simply increment the counter while the letter doesn't change and then add the count afterwards.
Another bug: the way you convert the counter into a string will stop working for any number larger than 9. You want
sprintf
.Also
strlen
returns the length of the string excluding the terminatingNULL
character so yourresult
is short by 1 in any case.Given the fairly severe bugs you really ought to test your code better.
-
\$\begingroup\$ Hello, for the bug, I believe I had an else statement, that will catch that and return the original string if the compressed string is actually longer. Does the code not do that? \$\endgroup\$user85591– user855912016年01月01日 03:07:08 +00:00Commented Jan 1, 2016 at 3:07
-
1\$\begingroup\$ @JonathanSmit You have some if else statements to return the original string if
result
is not 0 but that's just wrong since it implies an out of bound access which invokes undefined behavior. It may just happen to occasionally work but it's still broken. \$\endgroup\$ChrisWue– ChrisWue2016年01月01日 03:19:31 +00:00Commented Jan 1, 2016 at 3:19 -
1\$\begingroup\$ @ChrisWue: Might as well make it
snprintf()
(add the size protection). As for point #2, I'm not sure it always works that way with C, or maybe that's just me. At leastmemset()
can be something to fall back on. \$\endgroup\$Jamal– Jamal2016年01月01日 04:31:06 +00:00Commented Jan 1, 2016 at 4:31 -
\$\begingroup\$ @ChrisWue I added
if((j+1) < len
condition before each time i incrementj
and deference result. I also removed all the out of bounds access. Does that work better? Or should I be going about this differently? \$\endgroup\$user85591– user855912016年01月01日 04:35:47 +00:00Commented Jan 1, 2016 at 4:35 -
\$\begingroup\$ @JonathanSmit: Well, you should fix all the bugs, run some test cases - preferably with valgrind and then resubmit your code for review in a follow up question. I updated the question with another bug I found. \$\endgroup\$ChrisWue– ChrisWue2016年01月01日 05:05:36 +00:00Commented Jan 1, 2016 at 5:05
Don't speak of one
I know it is against what the problem expects, but an improvement in your compression algorithm would be to not print any number if there is only one occurrence of the letter.
This way, it would impossible to encounter that first bug mentioned in ChrisWue's answer. And, in files that don't have that many consecutive letters, they will have less added bytes from the compression.
However, as noted in two comments, if the digits are numbers, you can run into some issues. However, after reading the next section, you can see that, if you have consecutive numbers, they will be ASCII numbers, where the values accompanying the characters will not be ASCII.
ASCII numbers
This is also against the challenge output.
Most compression services don't leave an intentionally readable output file in the end. In that case, there is not much need to have the numbers accompanying the letters be ASCII. That way, when it comes to large repetitions of characters (at least > 10), you will not be spending an extra byte or two.
Now, you may think: if the number got high enough, it might interfere with the file content. That is true; but your code will do this now too. However, when decompressing, it can be kept in mind that the compression will result in two byte pairs: one byte for the character, one byte the character occurrences.
UTF
Right now, your compression algorithm won't do so well when it comes to UTF files. Since some UTF characters actually exceed 1 byte, this algorithm would probably end up not compressing anything at all.
Now, this may be out of the scope of your problem, but it would be a nice addition. If you look at the UTF-8 wiki page, you can see that it is somewhat possible to detect if a character is UTF, judging by the binary digits at the beginning of the digit.
This, of course, would raise complications as UTF-8 characters are variable length and you would have to determine the length in bytes of the character by the first binary digits.
To ease things up, you could have the user specify whether or not the file is UTF encoded (or, you might be able to find out by reading the file).
Misc
- Always use braces.
-
\$\begingroup\$ Hello, I am just doing this problem to practice coding interview questions, but thank you for all the useful information! \$\endgroup\$user85591– user855912016年01月01日 06:39:03 +00:00Commented Jan 1, 2016 at 6:39
-
\$\begingroup\$ If you "don't speak of 1", then the string "23" is ambiguous. Does it mean three twos or a two and a three? \$\endgroup\$David Richerby– David Richerby2016年01月01日 13:01:26 +00:00Commented Jan 1, 2016 at 13:01
-
\$\begingroup\$ Note that not including the repeat when it is 1 makes it harder to decode accurately, especially if the input contains digits:
b2211z
etc. \$\endgroup\$Jonathan Leffler– Jonathan Leffler2016年01月01日 15:48:47 +00:00Commented Jan 1, 2016 at 15:48 -
\$\begingroup\$ @DavidRicherby and JonathanLeffler These are good points. I've edited my post. \$\endgroup\$SirPython– SirPython2016年01月01日 22:20:45 +00:00Commented Jan 1, 2016 at 22:20
Bug. What happens if your input contains more than nine consecutive instances of the same character? What if it contains a lot more than nine? You don't do any bounds checking before assigning ++counter + '0'
into a char
variable and then printf
-ing it back in main
.
result
should belen+1
bytes long (to include the terminating null character) \$\endgroup\$