So, the concept of the code is simple. Take any file, read the binary character counts and construct a Huffman tree, and then output the bit code of each character (displayed as its 8-bit value).
This is for a class where we're learning proper modern C++ practices. So no C-style casting, no manual allocations, and unsafe pointers, etc.
I had to work a lot with unique_ptr and shared_ptr in this small assignment, and it's also my first time doing a const_cast which I know should be rarely if at all used.
#include <iostream>
#include <fstream>
#include <string>
#include <queue>
#include <vector>
#include <memory>
struct Node {
std::shared_ptr<Node> s_left;
std::shared_ptr<Node> s_right;
int16_t s_charcode;
uint64_t s_count;
Node(int16_t frequency, uint64_t c_count) {
s_left = s_right = nullptr;
s_charcode = frequency;
s_count = c_count;
}
Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r) {
if (r->s_count > l->s_count) {
s_right = r;
s_left = l;
}
else {
s_right = l;
s_left = r;
}
s_count = l->s_count + r->s_count;
s_charcode = -1;
}
};
class CompareNode
{
public:
bool operator()(std::shared_ptr<Node>& left, std::shared_ptr<Node>& right) {
return left->s_count > right->s_count;
}
};
class HuffmanEncoder {
public:
HuffmanEncoder(const std::string& path) :
char_counts(256)
{
m_input.open(path, std::ios::binary);
if (!m_input.is_open()) {
std::cout << "File not accessible or doesn't exist.\n";
exit(EXIT_FAILURE);
}
if (!evaluate_input()) {
std::cout << "Error while reading file.\n";
exit(EXIT_FAILURE);
}
}
void construct_tree();
void output_codes();
private:
std::basic_ifstream<uint8_t> m_input;
std::vector<uint64_t> char_counts;
std::priority_queue<std::shared_ptr<Node>, std::vector<std::shared_ptr<Node>>, CompareNode> min_heap;
bool evaluate_input();
bool bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits);
auto get_top_ptr();
};
bool HuffmanEncoder::evaluate_input()
{
uint8_t c;
while (m_input.get(c)) {
char_counts[c]++;
}
if (m_input.bad() || !m_input.eof())
return false;
for (uint16_t char_index = 0; char_index < 256; char_index++) {
if (char_counts[char_index] != 0)
min_heap.push(std::move(std::unique_ptr<Node>(new Node{ static_cast<int16_t>(char_index), char_counts[char_index] })));
}
return true;
}
// Has to be encapsulated cause it's a dangerous but valid operation
// Top method of pq returns const reference, but we need to move ownership of shared_ptr from pq, so const_cast
auto HuffmanEncoder::get_top_ptr()
{
std::shared_ptr<Node> temp = std::move(const_cast<std::shared_ptr<Node>&>(min_heap.top()));
min_heap.pop();
return std::move(temp);
}
void HuffmanEncoder::construct_tree() {
while (min_heap.size() > 1) {
std::shared_ptr<Node> node_one = get_top_ptr();
std::shared_ptr<Node> node_two = get_top_ptr();
min_heap.push(std::move(std::unique_ptr<Node>(new Node{ std::move(node_one), std::move(node_two) })));
}
}
// In-Order search
bool HuffmanEncoder::bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits)
{
if (node == nullptr)
return false;
if (node->s_charcode == character)
return true;
bits += "0";
if (bit_search_traversal(character, node->s_left, bits))
return true;
else
bits.pop_back();
bits += "1";
if (bit_search_traversal(character, node->s_right, bits))
return true;
else
bits.pop_back();
return false;
}
void HuffmanEncoder::output_codes() {
for (uint16_t charcode = 0; charcode < 256; charcode++) {
if (char_counts[charcode] != 0) {
std::string bits{ "" };
bit_search_traversal(static_cast<uint8_t>(charcode), min_heap.top(), bits);
std::cout << charcode << ": " << bits << std::endl;
}
}
}
int main(int argc, char* argv[])
{
if (argc != 2)
return EXIT_FAILURE;
HuffmanEncoder encoder{ argv[1] };
encoder.construct_tree();
encoder.output_codes();
return 0;
}
NOTE: The exit() statements in the constructor and methods are there due to how the automatic test evaluates the code, otherwise I'd throw and exception
The code is as simple as it can be I'd reckon. I'm just worried whether I'm using the unique_ptr and shared_ptr along with all the moves correctly. Wouldn't want leaks and security issues, would we?..
I'm also wondering if it'll be able to process files of about 10GB. As far as I know, basic_fstream with .get() is buffered so there shouldn't be an issue there, but I'm new to C++ (about two months) so I'm not entirely sure.
Code compiled with -W4 with no errors or warnings. Example input file worked as intended.
-
1\$\begingroup\$ Welcome to Code Review! Great first question, is it possible to add a small example test input with the generated output? \$\endgroup\$user228914– user2289142020年11月13日 17:15:05 +00:00Commented Nov 13, 2020 at 17:15
-
\$\begingroup\$ Example added now :) \$\endgroup\$Jack Avante– Jack Avante2020年11月13日 17:20:11 +00:00Commented Nov 13, 2020 at 17:20
2 Answers 2
struct Node {
std::shared_ptr<Node> s_left;
std::shared_ptr<Node> s_right;
I don't think we really need shared_ptr
here, as we're not really sharing ownership of the nodes. std::unique_ptr
should work fine.
int16_t s_charcode;
uint64_t s_count;
Being pedantic, but we should use the integer types in the std::
namespace for C++ (std::int16_t
, std::uint64_t
), instead of the C ones in the global namespace.
Node(int16_t frequency, uint64_t c_count) {
s_left = s_right = nullptr;
s_charcode = frequency;
s_count = c_count;
}
We should use a constructor initializer list to initialize member variables:
Node(int16_t frequency, uint64_t c_count):
s_charcode(frequency),
s_count(c_count) { }
Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r) {
if (r->s_count > l->s_count) {
s_right = r;
s_left = l;
}
else {
s_right = l;
s_left = r;
}
s_count = l->s_count + r->s_count;
s_charcode = -1;
}
Again, we can use the constructor initializer list. We can also avoid extra copies of the std::shared_ptr
by using std::move
, e.g.:
Node(std::shared_ptr<Node> l, std::shared_ptr<Node> r):
s_right(std::move(r)),
s_left(std::move(l)),
s_count(s_right->s_count + s_left->s_count),
s_charcode(-1)
{
if (s_right->s_count <= s_left->s_count)
s_right.swap(s_left);
}
bool operator()(std::shared_ptr<Node>& left, std::shared_ptr<Node>& right) {
return left->s_count > right->s_count;
}
This should take the shared_ptr
arguments by const &
. The function itself can also be declared const
, since it doesn't change anything in the CompareNode
class.
std::priority_queue<std::shared_ptr<Node>, std::vector<std::shared_ptr<Node>>, CompareNode> min_heap;
min_heap
is declared to use std::shared_ptr<Node>
, but is given std::unique_ptr
s:
for (uint16_t char_index = 0; char_index < 256; char_index++) {
if (char_counts[char_index] != 0)
min_heap.push(std::move(std::unique_ptr<Node>(new Node{ static_cast<int16_t>(char_index), char_counts[char_index] })));
}
Which is a little strange... We could create shared_ptr
s directly, or change the data structure to store unique_ptr
s.
Note that std::make_unique
and std::make_shared
exist, so we don't have to use new
.
auto HuffmanEncoder::get_top_ptr()
{
std::shared_ptr<Node> temp = std::move(const_cast<std::shared_ptr<Node>&>(min_heap.top()));
min_heap.pop();
return std::move(temp);
}
Note that return temp;
suffices. We don't need to explicitly call std::move
there.
min_heap.push(std::move(std::unique_ptr<Node>(new Node{ std::move(node_one), std::move(node_two) })));
Again, we can std::make_shared
or std::make_unique
, and there's no need for the outermost std::move
. The argument is already an r-value:
min_heap.push(std::make_unique<Node>(std::move(node_one), std::move(node_two)));
bool HuffmanEncoder::bit_search_traversal(const uint8_t& character, const std::shared_ptr<Node>& node, std::string& bits)
Passing the shared_ptr
by const&
is good. But there's no need to pass a Plain Old Data type like uint8_t
by const&
. It's faster to just copy it.
bits += "0";
if (bit_search_traversal(character, node->s_left, bits))
return true;
else
bits.pop_back();
Using a std::string
to collect the bits might be quite slow. Perhaps use a std::vector<bool>
for this and convert to string when we need to.
Also, I think we can avoid the pushing and popping from the string:
if (bit_search_traversal(character, node->s_left, bits))
{
bits += "0";
return true;
}
-
1\$\begingroup\$ We don't actually know whether to use
int16_t
orstd::int16_t
, since neither<stdint.h>
nor<cstdint>
is included! Obviously we prefer the latter in new C++ code, but the missing include is something else to mention in review. \$\endgroup\$Toby Speight– Toby Speight2020年11月13日 20:36:05 +00:00Commented Nov 13, 2020 at 20:36 -
\$\begingroup\$ Thanks for the awesome feedback! This will be very helpful to me in the future \$\endgroup\$Jack Avante– Jack Avante2020年11月13日 21:19:49 +00:00Commented Nov 13, 2020 at 21:19
-
1\$\begingroup\$ Just noticed. Where you suggest initializing the Node using the left and right node using an initializer list, it is not possible. Once one of the arguments was moved, it is not available anymore to be used in the next initialization step in the comparison \$\endgroup\$Jack Avante– Jack Avante2020年11月13日 22:53:20 +00:00Commented Nov 13, 2020 at 22:53
-
1\$\begingroup\$ Good point. I've edited the post with one possible way around this. \$\endgroup\$user673679– user6736792020年11月14日 10:27:27 +00:00Commented Nov 14, 2020 at 10:27
To add to user673679's answer:
Avoid using the auto
return type for member function declarations
You used the auto
return type for get_top_ptr()
, but this is the only feature you used that is not in C++11. So if you avoid this, your code can be compiled with a wider range of compilers. Also, it is much nicer to see up front in the function declaration inside class HuffmanEncoder
what return type to expect when calling this function.
Avoid non-standard ifstream
s
Unfortunately, the C++ standard does not guarantee that std::basic_ifstream<uint8_t>
works as expected, see this answer. To be compatible with as many C++ environments as possible, just use std::ifstream
, read char
s, and cast them as necessary, like so:
class HuffmanEncoder {
...
std::ifstream input;
};
bool HuffmanEncoder::evaluate_input()
{
char c;
while (m_input.get(c)) {
char_counts[static_cast<uint8_t>(c)]++;
}
...
}
Explore related questions
See similar questions with these tags.