Huffman encoding bitcodes display

Question 1

So, the concept of the code is simple. Take any file, read the binary character counts and construct a Huffman tree, and then output the bit code of each character (displayed as its 8-bit value).

This is for a class where we're learning proper modern C++ practices. So no C-style casting, no manual allocations, and unsafe pointers, etc.

I had to work a lot with unique_ptr and shared_ptr in this small assignment, and it's also my first time doing a const_cast which I know should be rarely if at all used.

#include <iostream>
#include <fstream>
#include <string>
#include <queue>
#include <vector>
#include <memory>
struct Node {
 std::shared_ptr<Node> s_left;
 std::shared_ptr<Node> s_right;
 int16_t s_charcode;
 uint64_t s_count;
 Node(int16_t frequency, uint64_t c_count) {
 s_left = s_right = nullptr;
 s_charcode = frequency;
 s_count = c_count;
 }
 Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r) {
 if (r->s_count > l->s_count) {
 s_right = r;
 s_left = l;
 }
 else {
 s_right = l;
 s_left = r;
 }
 s_count = l->s_count + r->s_count;
 s_charcode = -1;
 }
};
class CompareNode
{
public:
 bool operator()(std::shared_ptr<Node>& left, std::shared_ptr<Node>& right) {
 return left->s_count > right->s_count;
 }
};
class HuffmanEncoder {
public:
 HuffmanEncoder(const std::string& path) :
 char_counts(256)
 {
 m_input.open(path, std::ios::binary);
 if (!m_input.is_open()) {
 std::cout << "File not accessible or doesn't exist.\n";
 exit(EXIT_FAILURE);
 }
 if (!evaluate_input()) {
 std::cout << "Error while reading file.\n";
 exit(EXIT_FAILURE);
 }
 }
 void construct_tree();
 void output_codes();
private:
 std::basic_ifstream<uint8_t> m_input;
 std::vector<uint64_t> char_counts;
 std::priority_queue<std::shared_ptr<Node>, std::vector<std::shared_ptr<Node>>, CompareNode> min_heap;
 bool evaluate_input();
 bool bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits);
 auto get_top_ptr();
};
bool HuffmanEncoder::evaluate_input()
{
 uint8_t c;
 while (m_input.get(c)) {
 char_counts[c]++;
 }
 if (m_input.bad() || !m_input.eof())
 return false;
 for (uint16_t char_index = 0; char_index < 256; char_index++) {
 if (char_counts[char_index] != 0)
 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ static_cast<int16_t>(char_index), char_counts[char_index] })));
 }
 return true;
}
// Has to be encapsulated cause it's a dangerous but valid operation
// Top method of pq returns const reference, but we need to move ownership of shared_ptr from pq, so const_cast
auto HuffmanEncoder::get_top_ptr()
{
 std::shared_ptr<Node> temp = std::move(const_cast<std::shared_ptr<Node>&>(min_heap.top()));
 min_heap.pop();
 return std::move(temp);
}
void HuffmanEncoder::construct_tree() {
 while (min_heap.size() > 1) {
 std::shared_ptr<Node> node_one = get_top_ptr();
 std::shared_ptr<Node> node_two = get_top_ptr();
 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ std::move(node_one), std::move(node_two) })));
 }
}
// In-Order search
bool HuffmanEncoder::bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits)
{
 if (node == nullptr)
 return false;
 if (node->s_charcode == character)
 return true;
 bits += "0";
 if (bit_search_traversal(character, node->s_left, bits))
 return true;
 else
 bits.pop_back();
 bits += "1";
 if (bit_search_traversal(character, node->s_right, bits))
 return true;
 else
 bits.pop_back();
 return false;
}
void HuffmanEncoder::output_codes() {
 for (uint16_t charcode = 0; charcode < 256; charcode++) {
 if (char_counts[charcode] != 0) {
 std::string bits{ "" };
 bit_search_traversal(static_cast<uint8_t>(charcode), min_heap.top(), bits);
 std::cout << charcode << ": " << bits << std::endl;
 }
 }
}
int main(int argc, char* argv[])
{
 if (argc != 2)
 return EXIT_FAILURE;
 HuffmanEncoder encoder{ argv[1] };
 encoder.construct_tree();
 encoder.output_codes();
 return 0;
}

NOTE: The exit() statements in the constructor and methods are there due to how the automatic test evaluates the code, otherwise I'd throw and exception

The code is as simple as it can be I'd reckon. I'm just worried whether I'm using the unique_ptr and shared_ptr along with all the moves correctly. Wouldn't want leaks and security issues, would we?..

I'm also wondering if it'll be able to process files of about 10GB. As far as I know, basic_fstream with .get() is buffered so there shouldn't be an issue there, but I'm new to C++ (about two months) so I'm not entirely sure.

Code compiled with -W4 with no errors or warnings. Example input file worked as intended.

Here is a link to GDrive with an example input file and its output. The example_cleartext_input is the same as example_input

Question 2

Welcome to Code Review! Great first question, is it possible to add a small example test input with the generated output?

Question 3

Example added now :)

Question 4

struct Node {
 std::shared_ptr<Node> s_left;
 std::shared_ptr<Node> s_right;

I don't think we really need shared_ptr here, as we're not really sharing ownership of the nodes. std::unique_ptr should work fine.

int16_t s_charcode;
uint64_t s_count;

Being pedantic, but we should use the integer types in the std:: namespace for C++ (std::int16_t, std::uint64_t), instead of the C ones in the global namespace.

Node(int16_t frequency, uint64_t c_count) {
 s_left = s_right = nullptr;
 s_charcode = frequency;
 s_count = c_count;
}

We should use a constructor initializer list to initialize member variables:

Node(int16_t frequency, uint64_t c_count):
 s_charcode(frequency),
 s_count(c_count) { }

Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r) {
 if (r->s_count > l->s_count) {
 s_right = r;
 s_left = l;
 }
 else {
 s_right = l;
 s_left = r;
 }
 s_count = l->s_count + r->s_count;
 s_charcode = -1;
}

Again, we can use the constructor initializer list. We can also avoid extra copies of the std::shared_ptr by using std::move, e.g.:

Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r):
 s_right(std::move(r)),
 s_left(std::move(l)),
 s_count(s_right->s_count + s_left->s_count),
 s_charcode(-1)
{
 if (s_right->s_count <= s_left->s_count)
 s_right.swap(s_left);
}

 bool operator()(std::shared_ptr<Node>& left, std::shared_ptr<Node>& right) {
 return left->s_count > right->s_count;
 }

This should take the shared_ptr arguments by const &. The function itself can also be declared const, since it doesn't change anything in the CompareNode class.

std::priority_queue<std::shared_ptr<Node>, std::vector<std::shared_ptr<Node>>, CompareNode> min_heap;

min_heap is declared to use std::shared_ptr<Node>, but is given std::unique_ptrs:

for (uint16_t char_index = 0; char_index < 256; char_index++) {
 if (char_counts[char_index] != 0)
 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ static_cast<int16_t>(char_index), char_counts[char_index] })));
}

Which is a little strange... We could create shared_ptrs directly, or change the data structure to store unique_ptrs.

Note that std::make_unique and std::make_shared exist, so we don't have to use new.

auto HuffmanEncoder::get_top_ptr()
{
 std::shared_ptr<Node> temp = std::move(const_cast<std::shared_ptr<Node>&>(min_heap.top()));
 min_heap.pop();
 return std::move(temp);
}

Note that return temp; suffices. We don't need to explicitly call std::move there.

 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ std::move(node_one), std::move(node_two) })));

Again, we can std::make_shared or std::make_unique, and there's no need for the outermost std::move. The argument is already an r-value:

 min_heap.push(std::make_unique<Node>(std::move(node_one), std::move(node_two)));

bool HuffmanEncoder::bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits)

Passing the shared_ptr by const& is good. But there's no need to pass a Plain Old Data type like uint8_t by const&. It's faster to just copy it.

bits += "0";
if (bit_search_traversal(character, node->s_left, bits))
 return true;
else
 bits.pop_back();

Using a std::string to collect the bits might be quite slow. Perhaps use a std::vector<bool> for this and convert to string when we need to.

Also, I think we can avoid the pushing and popping from the string:

if (bit_search_traversal(character, node->s_left, bits))
{
 bits += "0";
 return true;
}

Question 5

We don't actually know whether to use int16_t or std::int16_t, since neither <stdint.h> nor <cstdint> is included! Obviously we prefer the latter in new C++ code, but the missing include is something else to mention in review.

Question 6

Thanks for the awesome feedback! This will be very helpful to me in the future

Question 7

Just noticed. Where you suggest initializing the Node using the left and right node using an initializer list, it is not possible. Once one of the arguments was moved, it is not available anymore to be used in the next initialization step in the comparison

Question 8

Good point. I've edited the post with one possible way around this.

Question 9

To add to user673679's answer:

Avoid using the `auto` return type for member function declarations

You used the auto return type for get_top_ptr(), but this is the only feature you used that is not in C++11. So if you avoid this, your code can be compiled with a wider range of compilers. Also, it is much nicer to see up front in the function declaration inside class HuffmanEncoder what return type to expect when calling this function.

Avoid non-standard `ifstream`s

Unfortunately, the C++ standard does not guarantee that std::basic_ifstream<uint8_t> works as expected, see this answer. To be compatible with as many C++ environments as possible, just use std::ifstream, read chars, and cast them as necessary, like so:

class HuffmanEncoder {
 ...
 std::ifstream input;
};
bool HuffmanEncoder::evaluate_input()
{
 char c;
 while (m_input.get(c)) {
 char_counts[static_cast<uint8_t>(c)]++;
 }
 ...
}

user673679 user673679 12.2k2 gold badges34 silver badges65 bronze badges · Accepted Answer · 2020-11-13 19:34:17Z

struct Node {
 std::shared_ptr<Node> s_left;
 std::shared_ptr<Node> s_right;

I don't think we really need shared_ptr here, as we're not really sharing ownership of the nodes. std::unique_ptr should work fine.

int16_t s_charcode;
uint64_t s_count;

Being pedantic, but we should use the integer types in the std:: namespace for C++ (std::int16_t, std::uint64_t), instead of the C ones in the global namespace.

Node(int16_t frequency, uint64_t c_count) {
 s_left = s_right = nullptr;
 s_charcode = frequency;
 s_count = c_count;
}

We should use a constructor initializer list to initialize member variables:

Node(int16_t frequency, uint64_t c_count):
 s_charcode(frequency),
 s_count(c_count) { }

Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r) {
 if (r->s_count > l->s_count) {
 s_right = r;
 s_left = l;
 }
 else {
 s_right = l;
 s_left = r;
 }
 s_count = l->s_count + r->s_count;
 s_charcode = -1;
}

Again, we can use the constructor initializer list. We can also avoid extra copies of the std::shared_ptr by using std::move, e.g.:

Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r):
 s_right(std::move(r)),
 s_left(std::move(l)),
 s_count(s_right->s_count + s_left->s_count),
 s_charcode(-1)
{
 if (s_right->s_count <= s_left->s_count)
 s_right.swap(s_left);
}

 bool operator()(std::shared_ptr<Node>& left, std::shared_ptr<Node>& right) {
 return left->s_count > right->s_count;
 }

This should take the shared_ptr arguments by const &. The function itself can also be declared const, since it doesn't change anything in the CompareNode class.

std::priority_queue<std::shared_ptr<Node>, std::vector<std::shared_ptr<Node>>, CompareNode> min_heap;

min_heap is declared to use std::shared_ptr<Node>, but is given std::unique_ptrs:

for (uint16_t char_index = 0; char_index < 256; char_index++) {
 if (char_counts[char_index] != 0)
 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ static_cast<int16_t>(char_index), char_counts[char_index] })));
}

Which is a little strange... We could create shared_ptrs directly, or change the data structure to store unique_ptrs.

Note that std::make_unique and std::make_shared exist, so we don't have to use new.

auto HuffmanEncoder::get_top_ptr()
{
 std::shared_ptr<Node> temp = std::move(const_cast<std::shared_ptr<Node>&>(min_heap.top()));
 min_heap.pop();
 return std::move(temp);
}

Note that return temp; suffices. We don't need to explicitly call std::move there.

 min_heap.push(std::move(std::unique_ptr<Node>(new Node{ std::move(node_one), std::move(node_two) })));

Again, we can std::make_shared or std::make_unique, and there's no need for the outermost std::move. The argument is already an r-value:

 min_heap.push(std::make_unique<Node>(std::move(node_one), std::move(node_two)));

bool HuffmanEncoder::bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits)

Passing the shared_ptr by const& is good. But there's no need to pass a Plain Old Data type like uint8_t by const&. It's faster to just copy it.

bits += "0";
if (bit_search_traversal(character, node->s_left, bits))
 return true;
else
 bits.pop_back();

Using a std::string to collect the bits might be quite slow. Perhaps use a std::vector<bool> for this and convert to string when we need to.

Also, I think we can avoid the pushing and popping from the string:

if (bit_search_traversal(character, node->s_left, bits))
{
 bits += "0";
 return true;
}

We don't actually know whether to use int16_t or std::int16_t, since neither <stdint.h> nor <cstdint> is included! Obviously we prefer the latter in new C++ code, but the missing include is something else to mention in review.
Thanks for the awesome feedback! This will be very helpful to me in the future
Just noticed. Where you suggest initializing the Node using the left and right node using an initializer list, it is not possible. Once one of the arguments was moved, it is not available anymore to be used in the next initialization step in the comparison
Good point. I've edited the post with one possible way around this.

Stack Exchange Network

Huffman encoding bitcodes display

2 Answers 2

Avoid using the `auto` return type for member function declarations

Avoid non-standard `ifstream`s

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Huffman encoding bitcodes display

2 Answers 2

Avoid using the auto return type for member function declarations

Avoid non-standard ifstreams

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Avoid using the `auto` return type for member function declarations

Avoid non-standard `ifstream`s