LZW Compression Library

Question 1

I wrote a library that implements LZW compression and decompression. A goal of this project was to help me acquaint myself with modern C++ development practices (I primarily come from a Java background and have a smattering of C experience).

I want to use this library to compress data and stream it over TCP sockets to be decompressed by the recipient, all without storing a compressed version of the full data on either the sender or the recipient's machine (for hobby/non-production purposes).

lzw.hpp

#pragma once
#include <iostream>
#include <optional>
#include <unordered_map>
#include <vector>
namespace lzw {
class lzw_encoder {
public:
 lzw_encoder(std::istream &is, std::ostream &os);
 void encode();
private:
 uint32_t current_code = 0;
 std::string current;
 std::unordered_map<std::string, uint32_t> codebook;
 std::istream &is;
 std::ostream &os;
};
class lzw_decoder {
public:
 lzw_decoder(std::istream &is, std::ostream &os);
 void decode();
private:
 std::vector<std::string> codebook;
 std::optional<uint32_t> prev;
 std::istream &is;
 std::ostream &os;
};
} // namespace lzw

lzw.cpp

#include "lzw.hpp"
namespace lzw {
static constexpr size_t ENCODER_BUFFER_SIZE = 256;
static constexpr size_t DECODER_BUFFER_SIZE = 64;
lzw_encoder::lzw_encoder(std::istream &is, std::ostream &os)
 : is(is), os(os), current_code(0) {
 for (current_code = 0; current_code < 256; ++current_code) {
 codebook[std::string(1, static_cast<char>(current_code))] = current_code;
 }
}
void lzw_encoder::encode() {
 char buffer[ENCODER_BUFFER_SIZE];
 while (true) {
 is.read(buffer, ENCODER_BUFFER_SIZE);
 auto read_length = is.gcount();
 if (read_length == 0)
 break;
 for (size_t i = 0; i < read_length; ++i) {
 current.push_back(buffer[i]);
 auto iter = codebook.find(current);
 if (iter == codebook.end()) {
 codebook[current] = current_code++;
 current.pop_back();
 auto code_val = codebook[current];
 os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));
 current.clear();
 current.push_back(buffer[i]);
 }
 }
 }
 if (current.size()) {
 auto code_val = codebook[current];
 os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));
 }
}
lzw_decoder::lzw_decoder(std::istream &is, std::ostream &os)
 : is(is), os(os), prev{} {
 for (int i = 0; i < 256; ++i) {
 codebook.emplace_back(1, static_cast<char>(i));
 }
}
void lzw_decoder::decode() {
 uint32_t buffer[DECODER_BUFFER_SIZE];
 while (true) {
 is.read(reinterpret_cast<char *>(buffer),
 DECODER_BUFFER_SIZE * sizeof(uint32_t));
 auto read_length = is.gcount() / sizeof(uint32_t);
 if (read_length == 0)
 break;
 for (size_t i = 0; i < read_length; ++i) {
 if (buffer[i] < codebook.size()) {
 os << codebook[buffer[i]];
 if (prev) {
 codebook.push_back(codebook[*prev] + codebook[buffer[i]].front());
 }
 } else {
 codebook.push_back(codebook[*prev] + codebook[*prev].front());
 os << codebook.back();
 }
 prev = buffer[i];
 }
 }
}
} // namespace lzw

I plan on replacing the unordered_map in the lzw_encoder with a dictionary trie in a future edit.

Does my code exhibit a reasonable way to use io streams?

I feel that my usage of read and write did not have a feeling of modern C++, and I'm wondering if I am unaware of some standard library tools to help me with binary io. In particular, I don't like that I used while(true) instead of some condition related to the input streams. Also, I was wondering if there was a way to do binary io without using reinterpret_cast to cast numeric/binary data pointers to char *.

Question 2

Here are some things I see that may help you improve your code.

Shouldn't a compressed file be smaller?

Imagine my surprise when I discovered that a 2037-byte file (the lzw.cpp source code itself) became 3524 bytes when "compressed!" The original LZW algorithm encoded 8-bit values into 12-bit codes. This appears to be encoding 8-bit values as 32-bit codes which is unlikely to offer much compression for short files like this. I did, however, try it on the plain text version of Bram Stoker's Dracula and, as expected, the resulting file was about 75% of the size of the original. Because it's a stream and you don't have access to the length of the source, there may not be much you can do about it, but it's probably a good thing to warn potential users about.

Rethink the interface

In order to use the compression, one must first create an object and then use it, perhaps like this:

lzw::lzw_encoder lzw(in, out);
lzw.encode();

Wouldn't it be nicer to just be able to do this?

lzw::encode(in, out);

Write member initializers in declaration order

The lzw_encoder class has this constructor

lzw_encoder::lzw_encoder(std::istream &is, std::ostream &os)
 : is(is), os(os), current_code(0) {
 for (current_code = 0; current_code < 256; ++current_code) {
 codebook[std::string(1, static_cast<char>(current_code))] = current_code;
 }
}

That looks fine, but in fact, current_code will be initialized before is and os because members are always initialized in declaration order and current_code is declared before is in this class. To avoid misleading another programmer, you could simply omit current_code since it is already initialized by the declaration:

uint32_t current_code = 0;

Use standard algorithms where appropriate

Initializing the codebook uses this:

for (current_code = 0; current_code < 256; ++current_code) {
 codebook[std::string(1, static_cast<char>(current_code))] = current_code;
}

This can be improved in a number of ways. First, we already know how large the codebook will be so we can reduce the number of memory reallocations by telling the compiler that information:

codebook.reserve(256);

Next, we can avoid the cast and gain a bit of efficiency by using emplace:

for (current_code = 0; current_code < 256; ++current_code) {
 codebook.emplace(std::string(1, current_code), current_code);
}

I'd also recommend replacing 256 here with a static constexpr initial_codebook_size.

Beware of endian-ness differences

The code currently contains these lines:

auto code_val = codebook[current];
os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));

There problem is that depending on whether this is a big-endian or little-endian machine, the encoding will be different. If the compressed stream is intended to be sent to a different machine, this needs to be consistent. Consider using something like the POSIX htonl function here.

Consider restructuring loops

The problem with while(true) is that it hides the loop exit condition. Instead of this:

while (true) {
 is.read(buffer, ENCODER_BUFFER_SIZE);
 auto read_length = is.gcount();
 if (read_length == 0)
 break;
 // etc
}

Consider something like this:

while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
 // handle full block
}
if (is.gcount()) {
 // handle final partial block
}

Understand the use of streams

It's possible that the caller has set one or both streams to throw an exception on encountering a failure such as end of file on read. Either override this or handle it appropriately.

Consider adding convenience functions

The handling of blocks for encode and for decode could both be made into functions within the namespace. This would make the restructuring of loops as mentioned above a bit easier and cleaner and would isolate the handling of the data structures from basic stream I/O. That may make things a bit easier when you convert to a trie. Here's my rewrite of the loop:

while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
 encode_buffer(buffer, ENCODER_BUFFER_SIZE);
}
encode_buffer(buffer, is.gcount());

Question 3

Thanks for all of the tips! I added the trie on a recent edit, and got rid of the while(true) reading loop in the encoder by using std::streambuf_iterator. For the decoder, instead of maintaining a buffer, I directly use std::istream::read to read a single uint32_t - I'm not sure what the performance difference would be in having my program maintain the read buffers

Question 4

I think removing the encode and decode methods makes sense. In your comment about removing the while(true) loops by handling the final partial block separately, I think because I am using std::istream instead of std::ifstream, there is no guarantee that an incompletely filled block would be the last block, e.g. when is is std::cin, the block would be written up to the next newline if I am not mistaken

Question 5

For using std::istream::read for a single uint32_t at a time, because most implementations I've seen are buffered, I doubt there would be a performance hit (but measure to be sure). For reading from std::in, you may be right. As you correctly guessed, I was thinking in terms of std::ifstreams.

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2020-08-18 14:35:39Z

Here are some things I see that may help you improve your code.

Shouldn't a compressed file be smaller?

Imagine my surprise when I discovered that a 2037-byte file (the lzw.cpp source code itself) became 3524 bytes when "compressed!" The original LZW algorithm encoded 8-bit values into 12-bit codes. This appears to be encoding 8-bit values as 32-bit codes which is unlikely to offer much compression for short files like this. I did, however, try it on the plain text version of Bram Stoker's Dracula and, as expected, the resulting file was about 75% of the size of the original. Because it's a stream and you don't have access to the length of the source, there may not be much you can do about it, but it's probably a good thing to warn potential users about.

Rethink the interface

In order to use the compression, one must first create an object and then use it, perhaps like this:

lzw::lzw_encoder lzw(in, out);
lzw.encode();

Wouldn't it be nicer to just be able to do this?

lzw::encode(in, out);

Write member initializers in declaration order

The lzw_encoder class has this constructor

lzw_encoder::lzw_encoder(std::istream &is, std::ostream &os)
 : is(is), os(os), current_code(0) {
 for (current_code = 0; current_code < 256; ++current_code) {
 codebook[std::string(1, static_cast<char>(current_code))] = current_code;
 }
}

That looks fine, but in fact, current_code will be initialized before is and os because members are always initialized in declaration order and current_code is declared before is in this class. To avoid misleading another programmer, you could simply omit current_code since it is already initialized by the declaration:

uint32_t current_code = 0;

Use standard algorithms where appropriate

Initializing the codebook uses this:

for (current_code = 0; current_code < 256; ++current_code) {
 codebook[std::string(1, static_cast<char>(current_code))] = current_code;
}

This can be improved in a number of ways. First, we already know how large the codebook will be so we can reduce the number of memory reallocations by telling the compiler that information:

codebook.reserve(256);

Next, we can avoid the cast and gain a bit of efficiency by using emplace:

for (current_code = 0; current_code < 256; ++current_code) {
 codebook.emplace(std::string(1, current_code), current_code);
}

I'd also recommend replacing 256 here with a static constexpr initial_codebook_size.

Beware of endian-ness differences

The code currently contains these lines:

auto code_val = codebook[current];
os.write(reinterpret_cast<char *>(&code_val), sizeof(code_val));

There problem is that depending on whether this is a big-endian or little-endian machine, the encoding will be different. If the compressed stream is intended to be sent to a different machine, this needs to be consistent. Consider using something like the POSIX htonl function here.

Consider restructuring loops

The problem with while(true) is that it hides the loop exit condition. Instead of this:

while (true) {
 is.read(buffer, ENCODER_BUFFER_SIZE);
 auto read_length = is.gcount();
 if (read_length == 0)
 break;
 // etc
}

Consider something like this:

while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
 // handle full block
}
if (is.gcount()) {
 // handle final partial block
}

Understand the use of streams

It's possible that the caller has set one or both streams to throw an exception on encountering a failure such as end of file on read. Either override this or handle it appropriately.

Consider adding convenience functions

The handling of blocks for encode and for decode could both be made into functions within the namespace. This would make the restructuring of loops as mentioned above a bit easier and cleaner and would isolate the handling of the data structures from basic stream I/O. That may make things a bit easier when you convert to a trie. Here's my rewrite of the loop:

while (is.read(buffer, ENCODER_BUFFER_SIZE)) {
 encode_buffer(buffer, ENCODER_BUFFER_SIZE);
}
encode_buffer(buffer, is.gcount());

Thanks for all of the tips! I added the trie on a recent edit, and got rid of the while(true) reading loop in the encoder by using std::streambuf_iterator. For the decoder, instead of maintaining a buffer, I directly use std::istream::read to read a single uint32_t - I'm not sure what the performance difference would be in having my program maintain the read buffers
I think removing the encode and decode methods makes sense. In your comment about removing the while(true) loops by handling the final partial block separately, I think because I am using std::istream instead of std::ifstream, there is no guarantee that an incompletely filled block would be the last block, e.g. when is is std::cin, the block would be written up to the next newline if I am not mistaken
For using std::istream::read for a single uint32_t at a time, because most implementations I've seen are buffered, I doubt there would be a performance hit (but measure to be sure). For reading from std::in, you may be right. As you correctly guessed, I was thinking in terms of std::ifstreams.

Stack Exchange Network

LZW Compression Library

1 Answer 1

Shouldn't a compressed file be smaller?

Rethink the interface

Write member initializers in declaration order

Use standard algorithms where appropriate

Beware of endian-ness differences

Consider restructuring loops

Understand the use of streams

Consider adding convenience functions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

LZW Compression Library

1 Answer 1

Shouldn't a compressed file be smaller?

Rethink the interface

Write member initializers in declaration order

Use standard algorithms where appropriate

Beware of endian-ness differences

Consider restructuring loops

Understand the use of streams

Consider adding convenience functions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions