A third follow up to the simple compression implementation. Adopted most of @holroy's suggestions. Main changes:
- Compression now uses high-bit to indicate that additional length bytes are still following
- Added decompression support
- Should no longer depend on
CHAR_BITS == 8
which increases portability - Use of
unsigned long long
since exact type size is not really important - Added additional options for printing command line help and decompression
- Options are no longer global, instead an
options
object is being passed around
And here comes the code:
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <strings.h>
#include <limits.h>
#define BIT_SHIFT (CHAR_BIT - 1)
const int HIGH_BIT = 1 << BIT_SHIFT;
const int LOWER_BITS = ~(~0 << BIT_SHIFT);
typedef struct
{
bool print_hex : 1;
bool decompress : 1;
bool print_help : 1;
} options;
void write_char(int c, options *opts)
{
if (printf(opts->print_hex ? "0x%x " : "%c", c) < 0)
{
perror("Error printing to stdout.");
exit(EXIT_FAILURE);
}
}
void write_count(unsigned long long count, options *opts)
{
int lower_bits = count & LOWER_BITS;
while (lower_bits != count || count > LOWER_BITS)
{
write_char(lower_bits | HIGH_BIT, opts);
count >>= BIT_SHIFT;
lower_bits = count & LOWER_BITS;
}
write_char(lower_bits, opts);
}
bool is_one_of(const char *value, const char *short_option, const char *long_option)
{
return (strcasecmp(value, short_option) == 0 || strcasecmp(value, long_option) == 0);
}
options parse_args(int argc, char **argv)
{
options opts = { 0 };
for (int i = 1; i < argc; ++i)
{
if (is_one_of(argv[i], "-x", "--hex"))
{
opts.print_hex = true;
}
else if (is_one_of(argv[i], "-h", "--help"))
{
opts.print_help = true;
}
else if (is_one_of(argv[i], "-e", "--decompress"))
{
opts.decompress = true;
}
else
{
fprintf(stderr, "Invalid option %s\n", argv[i]);
opts.print_help = true;
}
}
return opts;
}
void compress(options *opts)
{
int current_char = 0;
int previous_char = 0;
unsigned long long current_char_count = 0;
while (EOF != (current_char = getchar()))
{
if (current_char_count > 0 && current_char_count < ULLONG_MAX && current_char == previous_char)
{
current_char_count += 1;
}
else
{
if (current_char_count > 0)
{
write_count(current_char_count, opts);
}
write_char(current_char, opts);
current_char_count = 1;
}
previous_char = current_char;
}
// write remainder
if (current_char_count > 0)
{
write_count(current_char_count, opts);
}
}
unsigned long long read_length()
{
unsigned long long length = 0;
bool expect_more_length_bytes = true;
while (expect_more_length_bytes)
{
int c = getchar();
if (c == EOF)
{
fprintf(stderr, "unexpected end of stream, expected more length bytes\n");
exit(EXIT_FAILURE);
}
expect_more_length_bytes = (c & HIGH_BIT) != 0;
length <<= BIT_SHIFT;
length |= (c & LOWER_BITS);
}
return length;
}
void decompress(options *opts)
{
int current_char = 0;
while (EOF != (current_char = getchar()))
{
unsigned long long length = read_length();
for (unsigned long long i = 0; i < length; ++i)
{
write_char(current_char, opts);
}
}
}
void print_help(char *program_name)
{
printf("Usage: %s [options]\n", program_name);
puts("Options:");
puts(" -h | --help\t\tprint this help");
puts(" -x | --hex\t\tprint all characters in hexadecimal form separated by spaces");
puts(" -e | --decompress\tdecompress the input");
puts("by default the input from stdin will be compresssed and written to stdout");
}
int main(int argc, char** argv)
{
options opts = parse_args(argc, argv);
if (opts.print_help)
{
print_help(argv[0]);
}
else if (opts.decompress)
{
decompress(&opts);
}
else
{
compress(&opts);
}
}
As usual any suggestions welcome (portability, standard conformity, best practices, etc.).
-
\$\begingroup\$ When I run the source code through the compiled program, the result is almost twice the size of the original. That does not meet my definition of compression! \$\endgroup\$Edward– Edward2016年01月05日 21:10:32 +00:00Commented Jan 5, 2016 at 21:10
-
\$\begingroup\$ @Edward: Yeah, compression is probably not quite the right word - more like run length encoding. This program was originally inspired just by some other code review question. And believe it or not - I have an application where the input is suited to this and compresses quite nicely without the need of having to add a dependency on some external ZIP library. \$\endgroup\$ChrisWue– ChrisWue2016年01月05日 21:29:46 +00:00Commented Jan 5, 2016 at 21:29
2 Answers 2
I see a number of things that may help you improve your code. First, nice job on error handling! Too many programs posted for review here don't have any.
Consider portability
The code is currently using strcasecmp
which is a POSIX rather than a C standard library. That's not necessarily a problem unless you want to compile and run the code in a non-POSIX environment. If you'd like to use only standard stuff, you could use strcmp
in <string.h>
instead. If you are only intending to run this on POSIX-like environments, you might want to use getopt
instead of rolling your own option parser. Also consider whether you really want -e
and -E
to be the same thing. I'd recommend against that.
Avoid bitfields
Let the compiler handle the packing of bool
s within your struct. It's not size critical in this application and it may well be faster for the compiler to handle it some other way other than as a bitfield.
Consider using a better algorithm
As I mentioned in the comment, this program actually expands rather than compressing in many cases. We can do better! The current scheme stores this way:
[count][byte]
However, unless there at least three repetitions of the byte in the original file, this will actually blow up rather than compress. Consider the following alternative scheme:
[marker][count][byte]
OR
[byte]
Choose a specific MARKER
byte that occurs very infrequently in the input. For compression, count occurrences as you currently do. Then when either there are at least 4 repetitions (making it worthwhile to encode this way) or the input byte was the MARKER
character itself, encode using the first scheme. Otherwise, just write the original input byte. We can add just these two globals:
const char MARKER = '`';
const unsigned MIN_COMPRESS_BYTES = 3;
I've added this new routine to simplify the compression code:
void emit_token(options *opts, int current_char, unsigned long long current_char_count)
{
if (current_char_count > MIN_COMPRESS_BYTES || current_char == MARKER)
{
write_char(MARKER, opts);
write_count(current_char_count, opts);
current_char_count = 1;
}
for ( ; current_char_count; --current_char_count) {
write_char(current_char, opts);
}
}
There are only minor changes to the compress
routine:
void compress(options *opts)
{
int current_char = 0;
int previous_char = 0;
unsigned long long current_char_count = 0;
while (EOF != (current_char = getchar()))
{
if (current_char_count > 0 && current_char_count < ULLONG_MAX && current_char == previous_char)
{
current_char_count += 1;
}
else
{
emit_token(opts, previous_char, current_char_count);
current_char_count = 1;
}
previous_char = current_char;
}
// write remainder
if (current_char_count)
emit_token(opts, previous_char, current_char_count);
}
Even fewer changes were required for the decompress
routine:
void decompress(options *opts)
{
int current_char = 0;
while (EOF != (current_char = getchar()))
{
if (current_char != MARKER) {
write_char(current_char, opts);
} else {
unsigned long long length = read_length();
for (current_char = getchar(); length; --length) {
write_char(current_char, opts);
}
}
}
}
Results
Now instead of making the file almost twice the size, when I run the program on its own source code (4269 bytes on my machine), the resulting file is 3867 bytes long. For data on which the original code worked well, this modification should still work well. For data that the original code expanded, the worst case is considerably better.
#include <strings.h>
This header doesn't exist under non-POSIX implementations. It should be #include <string.h>
.
strcasecmp(value, short_option)
strcasecmp
is not in the C Standard Library. It is only defined in the POSIX implementations. For portability, you should implement your own wrapper function that calls the strcasecmp
function or your own equivalent function depending on if we're linking to POSIX.
int mystrcasecmp(const char *a, const char *b)
{
#ifdef POSIX
return strcasecmp(a, b);
#else
return mycustomstrcasecmp(a, b);
#endif
}
typedef struct
{
bool print_hex : 1;
bool decompress : 1;
bool print_help : 1;
} options;
I personally dislike bitfields for many of the reasons listed here. I would just use a nice and clear implementation like this:
typedef struct
{
bool print_hex;
bool decompress;
bool print_help;
} options;
If you really feel like twiddling with bits, use an unsigned char
and assign a bit to each option. Then mask for each of them.
typedef struct
{
// 1 is print_hex (mask of 0xFE)
// 2 is decompress (mask of 0xFD)
// 4 is print_help (mask of 0xFC)
unsigned char ops;
} options;
-
\$\begingroup\$ As stated in the help output - compressing the input is the default behaviour so that is not bug - it's a feature :). \$\endgroup\$ChrisWue– ChrisWue2016年01月05日 21:25:26 +00:00Commented Jan 5, 2016 at 21:25
-
\$\begingroup\$ Actually I should probably get rid of the case insensitive comparison - options are generally treated as case sensitive (at least on Unix like systems) \$\endgroup\$ChrisWue– ChrisWue2016年01月05日 21:35:10 +00:00Commented Jan 5, 2016 at 21:35