As a follow up to my previous question I've improved the code + algorithm. The compression now works the following way:
Each character is followed by a length byte. The top 3 bits of that byte denote the number of additional length bytes which encode the count. The count is stored in "little endian order" (least significant byte first) with the first 5 bits being encoded in the initial length byte.
Some examples:
Count == 1 yields in length byte 00000001 Count == 31 yields in length byte 00011111 Count == 32 yields in length bytes 00100000 followed by 0000001 Count == 33 yields in length bytes 00100001 followed by 0000001
The program supports a command line option for printing the output in hex for easier reading and debugging. I tried to use a const int
for the NUM_LENGTH_BITS
as well but then the compiler claimed that the initializer element for MAX_COUNT
isn't constant. Not sure if that can be worked around.
I've limited the max count to 61 bits to leave 3 bits for the length specifier and have a maximum of 8 bytes for the length output.
All types of feedback appreciated.
#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#define NUM_LENGTH_BITS 3
static const uint64_t MAX_COUNT = (~0ULL) >> NUM_LENGTH_BITS;
static bool print_hex = false;
static int most_significant_bit(uint64_t value)
{
int i = 0;
while (value)
{
++i;
value >>= 1;
}
return i;
}
static int num_bytes_required(uint64_t value)
{
int msb = most_significant_bit(value);
int bytes = msb / 8;
int remainder = msb % 8;
if (remainder)
{
bytes += (remainder + NUM_LENGTH_BITS) > 8 ? 2 : 1;
}
return bytes;
}
void write_char(int c)
{
if (print_hex)
{
if (printf("0x%x ", c) < 0)
{
perror("error printing to stdout");
exit(EXIT_FAILURE);
}
}
else
{
if (EOF == putchar(c) && ferror(stdout))
{
perror("error writing to stdout");
exit(EXIT_FAILURE);
}
}
}
void write_count(uint64_t count)
{
int additional_length_bytes = num_bytes_required(count) - 1;
assert(additional_length_bytes >= 0 && additional_length_bytes < 8);
int first = ((additional_length_bytes << 5) | (count & 0x1F)) & 0xFF;
write_char(first);
count >>= 5;
while (count)
{
write_char(count & 0xFF);
count >>= 8;
}
}
static void parse_args(int argc, char **argv)
{
if (argc == 2)
{
print_hex = (stricmp(argv[1], "-h") == 0 || stricmp(argv[1], "--hex") == 0);
}
else
{
printf("Invalid number of arguments passed: %d\n", argc - 1);
exit(EXIT_FAILURE);
}
}
int main(int argc, char** argv)
{
int current_char = 0;
int previous_char = 0;
uint64_t current_char_count = 0;
parse_args(argc, argv);
while (true)
{
current_char = getchar();
if (current_char_count == 0 || current_char_count == MAX_COUNT || previous_char != current_char)
{
if (current_char_count > 0)
{
write_count(current_char_count);
}
if (EOF != current_char)
{
write_char(current_char);
current_char_count = 1;
previous_char = current_char;
}
else
{
break;
}
}
else
{
current_char_count += 1;
}
}
}
-
\$\begingroup\$ Follow up question \$\endgroup\$ChrisWue– ChrisWue2016年01月05日 19:56:50 +00:00Commented Jan 5, 2016 at 19:56
2 Answers 2
Although your code looks rather nice, there always alternate ways of doing stuff with its pros and cons. I'll comment on some of this alternate solutions to your implementation:
Simplify
write_count()
– Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
const int BIT_SHIFT = 7; const int LOWER_BITS = 0x7f; const int HIGH_BIT = 0x80; void write_count(uint64_t count) { int lower_bits = count & LOWER_BITS; while (lower_bits != count || count > LOWER_BITS) { write_char(lower_bits | HIGH_BIT); count >>= BIT_SHIFT; lower_bits = count & LOWER_BITS; } write_char(lower_bits); }
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.
Here is an example run showing the transition from 127 to 128 repetitions:
echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1
That is 1
e
, 127 (0x7f)a
's, 128b
's (0x80 0x1), and 1c
.Avoid magic numbers – In your version of
write_count()
you use the magic numbers of5
and8
, whilst you've defined theNUM_LENGTH_BITS
. Be consistent, and try to avoid most use of magic numbers. If you changed yourNUM_LENGTH_BITS
the rest of your code would not follow that change.Avoid repeating your self – In
write_char()
theperror()
&exit()
part is repeated. If you change to using onlyprintf()
this can be avoided, and simplified into:void write_char(int c) { if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) { perror("Error printing to stdout."); exit(EXIT_FAILURE); } }
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the
print_hex
is awkward to pass as parameters everywhere. I tend to either uppercase them, or addG_
in front of them so that it becomesG_print_hex
. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
stricmp
is limited to Windows –stricmp
is Windows specific (see here), so please usestrcasecmp
instead which is compliant with C99- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
If you add more parameters the
argc == 2
fails – For test purposes I tried adding a-v
parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).A proper parameter handling would require a
for
loop, and should most likely at least include the-x
/--hex
parameters, as well as a-h
/--help
parameters. In addition to allowing no parameters.
Consider simplifying
main()
– A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.I.e. if you renamed your current
main()
toencode()
, you could easily add thedecode()
to the mix based on parameters. Having both of these withinmain()
would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.Try to reduce nesting of
if
's – This a little based on taste and personal opinions, but I tend to avoid usingwhile (true)
andbreak
if possible, and try to keep the nesting ofif
's to a minimum.In your case, this can be done using the following:
while ((current_char = getchar()) != EOF) { if (current_char == previous_char) { current_char_count += 1; } else { if (current_char_count > 0) { write_count(current_char_count); } write_char(current_char); current_char_count = 1; } previous_char = current_char; } if (current_char_count > 0) { write_count(current_char_count); }
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the
if
's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between theif
and theelse
so that it is easier to see why we entered theelse
clause.I've removed the reference related to
MAX_COUNT
as I consider it rather esoteric if you run this code on something with2^64
repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additionalwrite_count()
at the end. Still I think this reads somewhat easier than your implementation.Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in
C
, as I feel it make the code somewhat easier to read and slightly more compact.Do however note that I still keep braces around one-line blocks, as you do.
Refactored code
Here is the complete refactored code (using opening braces on previous line):
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;
bool G_print_hex = false;
void write_char(int c)
{
if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
perror("Error printing to stdout.");
exit(EXIT_FAILURE);
}
}
void write_count(uint64_t count)
{
int lower_bits = count & LOWER_BITS;
while (lower_bits != count || count > LOWER_BITS) {
write_char(lower_bits | HIGH_BIT);
count >>= BIT_SHIFT;
lower_bits = count & LOWER_BITS;
}
write_char(lower_bits);
}
void parse_args(int argc, char **argv)
{
if (argc == 2) {
G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);
} else if (argc != 1) {
printf("Invalid number of arguments passed: %d\n", argc - 1);
exit(EXIT_FAILURE);
}
}
int main(int argc, char** argv)
{
int current_char = 0;
int previous_char = -1; // A non-legal char, so it triggers a change on first character begin chr(0)
uint64_t current_char_count = 0;
parse_args(argc, argv);
while ((current_char = getchar()) != EOF) {
if (current_char == previous_char) {
current_char_count += 1;
} else {
if (current_char_count > 0) {
write_count(current_char_count);
}
write_char(current_char);
current_char_count = 1;
}
previous_char = current_char;
}
if (current_char_count > 0) {
write_count(current_char_count);
}
}
I've also used two blank lines in between functions to help them stand out a little more, and I've been very lazy in writing comments.
PS! Using previous_char=-1
one avoids a bug when testing this with dd if=/dev/zero bs=1 count=65
, as getchar()
only returns unsigned char
's or EOF. But previous_char
is declared as an int, and as such using -1
as a start value, is safer than the original 0
.
-
\$\begingroup\$ Yeah, the command line thing I tacked on as a quick afterthought - I guess it shows. Good catch on the
stricmp
and the argument count checking. \$\endgroup\$ChrisWue– ChrisWue2016年01月04日 21:34:57 +00:00Commented Jan 4, 2016 at 21:34
A few minor notes (looks pretty good for the most part!):
Your
most_significant_bit()
function loop should be condensed into afor
loop. The main difference between thefor
's and thewhile
's is a matter of pragmatics: we usually usefor
when there is a known number of iterations (which might not seem like the case here, but it's true), and usewhile
constructs when the number of iterations in not known in advance.I'm not a fan of you using
-h
as the command line argument equivalent to--hex
. Whenever I think of-h
, I think of it as the shorthand of--help
. A better shorthand would be-x
in my opinion.Why do you do
current_char_count += 1
? Why notcurrent_char_count++
/++current_char_count
(both do the same thing in the same amount of time in this case)?
-
\$\begingroup\$ The
for
loop I disagree with - I find thewhile
loop in this case semantically more appropriate (if I explain the algorithm to myself in plain english it goes something like "while there are still bits set, shift value to right by one and increment the counter" - therefore I use awhile
loop). \$\endgroup\$ChrisWue– ChrisWue2016年01月04日 06:42:22 +00:00Commented Jan 4, 2016 at 6:42 -
\$\begingroup\$ Good point about the
-h
. Last one: depends what kind of code I've been reading in the most recent past. Currently I prefervar += 1
if it's a standalone statement. Technically the pre- instead of the post-fix increment should be used if any since the post-fix operator technically requires that a copy should be made (since the value before the operation has to be returned) - admittedly in this case the compiler will very likely optimize this away but still. \$\endgroup\$ChrisWue– ChrisWue2016年01月04日 06:42:26 +00:00Commented Jan 4, 2016 at 6:42 -
\$\begingroup\$ A post fix operator translates to a copy and increment while a prefix operator translates to an increment and copy. So they are both equally inexpensive (and in this case there is not even something to copy to). \$\endgroup\$Roy T.– Roy T.2016年01月04日 09:14:32 +00:00Commented Jan 4, 2016 at 9:14
-
\$\begingroup\$ @ChrisWue I guess it's all about how you understand the code then. For me,
for
loops are super easy to read, even in plain English. Plus the counter variable (which is usually justi
, but could vary) is more obvious to me and reduced in scope. \$\endgroup\$syb0rg– syb0rg2016年01月04日 15:01:17 +00:00Commented Jan 4, 2016 at 15:01 -
\$\begingroup\$ @syb0rg: sure, except in this case the loop counter would need to live beyond the score of the loop so it can be returned (unless I misunderstand the way you intended to use the four loop) \$\endgroup\$ChrisWue– ChrisWue2016年01月04日 17:43:44 +00:00Commented Jan 4, 2016 at 17:43