I wrote a library that implements LZW compression and decompression. A goal of this project was to help me acquaint myself with modern C++ development practices (I primarily come from a Java background and have a smattering of C experience).
I want to use this library to compress data and stream it over TCP sockets to be decompressed by the recipient, all without storing a compressed version of the full data on either the sender or the recipient's machine (for hobby/non-production purposes).
lzw.hpp
#pragma once
#include <iostream>
#include <optional>
#include <unordered_map>
#include <vector>
namespace lzw {
class lzw_encoder {
public:
lzw_encoder(std::istream &is, std::ostream &os);
void encode();
private:
uint32_t current_code = 0;
std::string current;
std::unordered_map<std::string, uint32_t> codebook;
std::istream &is;
std::ostream &os;
};
class lzw_decoder {
public:
lzw_decoder(std::istream &is, std::ostream &os);
void decode();
private:
std::vector<std::string> codebook;
std::optional<uint32_t> prev;
std::istream &is;
std::ostream &os;
};
} // namespace lzw
lzw.cpp
#include "lzw.hpp"
namespace lzw {
static constexpr size_t ENCODER_BUFFER_SIZE = 256;
static constexpr size_t DECODER_BUFFER_SIZE = 64;
lzw_encoder::lzw_encoder(std::istream &is, std::ostream &os)
: is(is), os(os), current_code(0) {
for (current_code = 0; current_code < 256; ++current_code) {
codebook[std::string(1, static_cast<char>(current_code))] = current_code;
}
}
void lzw_encoder::encode() {
char buffer[ENCODER_BUFFER_SIZE];
while (true) {
is.read(buffer, ENCODER_BUFFER_SIZE);
auto read_length = is.gcount();
if (read_length == 0)
break;
for (size_t i = 0; i < read_length; ++i) {
current.push_back(buffer[i]);
auto iter = codebook.find(current);
if (iter == codebook.end()) {
codebook[current] = current_code++;
current.pop_back();
auto code_val = codebook[current];
os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));
current.clear();
current.push_back(buffer[i]);
}
}
}
if (current.size()) {
auto code_val = codebook[current];
os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));
}
}
lzw_decoder::lzw_decoder(std::istream &is, std::ostream &os)
: is(is), os(os), prev{} {
for (int i = 0; i < 256; ++i) {
codebook.emplace_back(1, static_cast<char>(i));
}
}
void lzw_decoder::decode() {
uint32_t buffer[DECODER_BUFFER_SIZE];
while (true) {
is.read(reinterpret_cast<char *>(buffer),
DECODER_BUFFER_SIZE * sizeof(uint32_t));
auto read_length = is.gcount() / sizeof(uint32_t);
if (read_length == 0)
break;
for (size_t i = 0; i < read_length; ++i) {
if (buffer[i] < codebook.size()) {
os << codebook[buffer[i]];
if (prev) {
codebook.push_back(codebook[*prev] + codebook[buffer[i]].front());
}
} else {
codebook.push_back(codebook[*prev] + codebook[*prev].front());
os << codebook.back();
}
prev = buffer[i];
}
}
}
} // namespace lzw
I plan on replacing the unordered_map in the lzw_encoder with a dictionary trie in a future edit.
Does my code exhibit a reasonable way to use io streams?
I feel that my usage of read and write did not have a feeling of modern C++, and I'm wondering if I am unaware of some standard library tools to help me with binary io. In particular, I don't like that I used
while(true)
instead of some condition related to the input streams. Also, I was wondering if there was a way to do binary io without using reinterpret_cast
to cast numeric/binary data pointers to char *
.
1 Answer 1
Here are some things I see that may help you improve your code.
Shouldn't a compressed file be smaller?
Imagine my surprise when I discovered that a 2037-byte file (the lzw.cpp source code itself) became 3524 bytes when "compressed!" The original LZW algorithm encoded 8-bit values into 12-bit codes. This appears to be encoding 8-bit values as 32-bit codes which is unlikely to offer much compression for short files like this. I did, however, try it on the plain text version of Bram Stoker's Dracula and, as expected, the resulting file was about 75% of the size of the original. Because it's a stream and you don't have access to the length of the source, there may not be much you can do about it, but it's probably a good thing to warn potential users about.
Rethink the interface
In order to use the compression, one must first create an object and then use it, perhaps like this:
lzw::lzw_encoder lzw(in, out);
lzw.encode();
Wouldn't it be nicer to just be able to do this?
lzw::encode(in, out);
Write member initializers in declaration order
The lzw_encoder
class has this constructor
lzw_encoder::lzw_encoder(std::istream &is, std::ostream &os)
: is(is), os(os), current_code(0) {
for (current_code = 0; current_code < 256; ++current_code) {
codebook[std::string(1, static_cast<char>(current_code))] = current_code;
}
}
That looks fine, but in fact, current_code
will be initialized before is
and os
because members are always initialized in declaration order and current_code
is declared before is
in this class. To avoid misleading another programmer, you could simply omit current_code
since it is already initialized by the declaration:
uint32_t current_code = 0;
Use standard algorithms where appropriate
Initializing the codebook uses this:
for (current_code = 0; current_code < 256; ++current_code) {
codebook[std::string(1, static_cast<char>(current_code))] = current_code;
}
This can be improved in a number of ways. First, we already know how large the codebook will be so we can reduce the number of memory reallocations by telling the compiler that information:
codebook.reserve(256);
Next, we can avoid the cast and gain a bit of efficiency by using emplace
:
for (current_code = 0; current_code < 256; ++current_code) {
codebook.emplace(std::string(1, current_code), current_code);
}
I'd also recommend replacing 256
here with a static constexpr initial_codebook_size
.
Beware of endian-ness differences
The code currently contains these lines:
auto code_val = codebook[current];
os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));
There problem is that depending on whether this is a big-endian or little-endian machine, the encoding will be different. If the compressed stream is intended to be sent to a different machine, this needs to be consistent. Consider using something like the POSIX htonl
function here.
Consider restructuring loops
The problem with while(true)
is that it hides the loop exit condition. Instead of this:
while (true) {
is.read(buffer, ENCODER_BUFFER_SIZE);
auto read_length = is.gcount();
if (read_length == 0)
break;
// etc
}
Consider something like this:
while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
// handle full block
}
if (is.gcount()) {
// handle final partial block
}
Understand the use of streams
It's possible that the caller has set one or both streams to throw an exception on encountering a failure such as end of file on read. Either override this or handle it appropriately.
Consider adding convenience functions
The handling of blocks for encode and for decode could both be made into functions within the namespace. This would make the restructuring of loops as mentioned above a bit easier and cleaner and would isolate the handling of the data structures from basic stream I/O. That may make things a bit easier when you convert to a trie. Here's my rewrite of the loop:
while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
encode_buffer(buffer, ENCODER_BUFFER_SIZE);
}
encode_buffer(buffer, is.gcount());
-
\$\begingroup\$ Thanks for all of the tips! I added the trie on a recent edit, and got rid of the
while(true)
reading loop in the encoder by using std::streambuf_iterator. For the decoder, instead of maintaining a buffer, I directly use std::istream::read to read a single uint32_t - I'm not sure what the performance difference would be in having my program maintain the read buffers \$\endgroup\$user129137– user1291372020年08月18日 15:56:52 +00:00Commented Aug 18, 2020 at 15:56 -
\$\begingroup\$ I think removing the
encode
anddecode
methods makes sense. In your comment about removing thewhile(true)
loops by handling the final partial block separately, I think because I am usingstd::istream
instead ofstd::ifstream
, there is no guarantee that an incompletely filled block would be the last block, e.g. whenis
isstd::cin
, the block would be written up to the next newline if I am not mistaken \$\endgroup\$user129137– user1291372020年08月18日 16:02:42 +00:00Commented Aug 18, 2020 at 16:02 -
1\$\begingroup\$ For using
std::istream::read
for a singleuint32_t
at a time, because most implementations I've seen are buffered, I doubt there would be a performance hit (but measure to be sure). For reading fromstd::in
, you may be right. As you correctly guessed, I was thinking in terms ofstd::ifstream
s. \$\endgroup\$Edward– Edward2020年08月18日 16:35:51 +00:00Commented Aug 18, 2020 at 16:35