I would be very grateful to get some thoughts on this toy-implementation of a trie that I wrote to practice my C++.
Some questions I have:
- Is my
create
member function idiomatic? What's the standard way avoid duplicated copy in the copy-constructor and assignment operator? - Is there a way that I could avoid using the virtual function
get_word
? It seems safer to use a virtual function (to avoid accessing undefined memory), but it I don't see why I shouldn't be able to lookup the address wherethis->word
could reside.
General tips are appreciated, not just answers to these questions.
Thanks!
#include <unordered_map>
#include <string>
#include <iostream>
#include <stdexcept>
class TrieNode {
public:
std::unordered_map<char, TrieNode*> children;
TrieNode() {}
TrieNode(TrieNode const& trie_node) { create(trie_node); };
TrieNode& operator=(TrieNode const& rhs) {
if (&rhs != this) {
children.clear();
create(rhs);
}
return *this;
}
virtual ~TrieNode() {
for (auto const& it: children) {
delete it.second;
}
}
void create(TrieNode const&);
virtual std::string const& get_word() const { throw std::domain_error("get_word() called on non-terminal node."); };
};
class TerminalTrieNode: public TrieNode {
// terminal nodes represent full words
// they are pointed to by edges with key '0円'
public:
std::string word;
std::string const& get_word() const {return word;}
TerminalTrieNode(std::string word): word(word) {}
TerminalTrieNode(TrieNode const* terminal_trie_node) {
create(*terminal_trie_node);
word = terminal_trie_node->get_word();
}
};
void TrieNode::create(TrieNode const& trie_node) {
for (auto const& it: trie_node.children) {
if (it.first != '0円')
children[it.first] = new TrieNode(*it.second);
else
children[it.first] = new TerminalTrieNode(it.second);
}
}
class Trie {
private:
TrieNode root;
void apply_all(TrieNode const* node, void (function(std::string const&))) const {
// helper function for prefix_apply
// applies the given function to every word in the trie below the given word
auto it = node->children.find('0円');
if (it != node->children.end()) {
function(it->second->get_word());
}
for (auto const& c_it: node->children)
apply_all(c_it.second, function);
}
public:
void insert(std::string const& word) {
// add a new word to the trie
TrieNode* node = &root;
for (char const c: word) {
if (!node->children[c])
node->children[c] = new TrieNode();
node = node->children[c];
}
//mark as a terminal node
node->children['0円'] = new TerminalTrieNode(word);
}
bool contains(std::string const& word) const{
// set-membership test
TrieNode const* node = &root;
for (char const c: word) {
auto it = node->children.find(c);
if (it == node->children.end()) return false;
node = it->second;
}
return node->children.find('0円') != node->children.end();
}
void prefix_apply(std::string const& prefix, void (function(std::string const&))) const {
// Apply the given function to every word in the trie that has the given prefix
TrieNode const* node = &root;
for (char const c: prefix) {
auto it = node->children.find(c);
if (it == node->children.end()) return;
node = it->second;
}
apply_all(node, function);
}
};
int main(int argc, char** argv) {
Trie trie;
trie.insert("apple");
trie.insert("app");
trie.insert("apl");
trie.insert("back");
trie.insert("appl");
Trie new_trie = trie;
new_trie.prefix_apply("app", [](std::string const& s) {std::cout << s << '\n';});
}
3 Answers 3
Design
I don't think that TerminalTrieNode
is that good of an idea, for multiple reasons:
You effectively cannot insert byte-strings that contain the
char
value'0円'
. While this isn't likely to come up in purely ASCII based applications, it might e.g. for variants of unicode or handling of raw byte strings.Having to use pointers in
TrieNode::children
(due to inheritance) means(削除) doubling (削除ここまで)increasing the amount of pointer dereferences: One for the actual pointer, and at least one hidden in thestd::unordered_map
. This makes all operations slower due to cache misses.The whole
get_word
"hack" smells (as you noticed).
Also, it would be nice to retrieve information about which path a node is on.
Ideally, I'd like something along these lines:
class TrieNode {
// Note: store TrieNode by value!
std::unordered_map<char, TrieNode> children;
// simple indicator if this is a leaf node (end of a word)
bool is_leaf_node;
// A reference to the path so far
// Note: if memory usage gets too big, this could be replaced by a std::string_view
// and storing the leaf strings in a dedicated pool. Non-leaf nodes would only refer to
// a part of the stored string.
std::string path;
public:
TrieNode(std::string);
std::string_view word() const;
void make_leaf();
bool is_leaf() const;
};
Sadly, this won't compile, as
std::unordered_map
doesn't support incomplete types (TrieNode
is incomplete at the declaration ofTrieNode::children
). One easy solution would be using another hash map type that supports incomplete types, e.g.boost::unordered_map
.But: This solution sidesteps a lot (if not all) of the problems mentioned in @JVApen's answer.
Other issues
Trie::prefix_apply
only takes non-member functions, i.e. no function objects (like capturing lambdas or instance of classes with overloaded operator(std::string const&)
) or possibly member functions (including the corresponding object). This can be fixed in two ways:
Take a
std::function<void(std::string const&)>
as parameter. This supports all of the above, but might incur a heap allocation upon construction.Template it on a
Callable&&
which gets deduced for the passed function. This doesn't incur a heap allocation, but it also doesn't support the (admittedly rare case) usage of pointer to member functions. It also might cause some binary bloat, as a new function is generated for each instantiation (though they might provide more optimization opportunities, e.g. inlining).
A different design choice would be providing an iterator interface, though implementation of that is a bit more work. The advantage of that would be better compatibility with standard algorithms.
Looking at this code, I have several remarks, let me start with your questions:
Is my create member function idiomatic?
No, I barely see it. Mostly because it doesn't add value. Either classes are small enough so it doesn't matter, or copy/assign ain't supported, or it can be defaulted.
In this case, it looks better to have a virtual method clone
on the nose or make the structure templated (preferred by me).
Is there a way that I could avoid using the virtual function get_word?
In general I would make the node type would be a template argument, yes. Otherwise, you could cast (similar to CRTP), however that requires knowledge of the available types. For this specific case, I realize this ain't possible, however, it does make sense having only one node.
Ownership
You have some memory leaks in your code. I suggest to use a simple rule of thumb: always use unique_ptr instead of new/delete. Or when possible, just store by value in the map.
Relevance of member functions
prefix_apply
is a very good example of a function that should be a free function. You make a copy and adapt the whole structure. Why not build up a new map?
At the same time, it is very specific, so you could replace it by the visitor pattern.
Rule of five
Applying the rule of five allows you to move instead of copying. Most of the time, I even write move operations with the copy variant deleted.
Public members
Just don't, this is very hard to maintain. Especially when you encounter a specific implementation is buggy.
Purpose and performance
It looks like you are trying to make an ever growing string pool in which you want to share strings. How about using std::unordered_set
or just a vector on which to unique/erase.
Both variants will be more memory efficient and most like more performant.
-
\$\begingroup\$ A
trie
is a specific data structure, used to hold an associative array. There's a specific tag for it on Stack Overflow. It's common to spell out "five" in "Rule of Five" and not use the digit. \$\endgroup\$1201ProgramAlarm– 1201ProgramAlarm2018年07月17日 17:44:04 +00:00Commented Jul 17, 2018 at 17:44 -
\$\begingroup\$ I stand corrected, let me remove \$\endgroup\$JVApen– JVApen2018年07月17日 17:52:52 +00:00Commented Jul 17, 2018 at 17:52
-
\$\begingroup\$ Thanks for your response! Are there really memory leaks? Doesn't
~TrieNode()
recursively delete all of the nodes children before deleting itself? \$\endgroup\$Reid Hayes– Reid Hayes2018年07月17日 18:12:10 +00:00Commented Jul 17, 2018 at 18:12 -
1\$\begingroup\$ Operator= doesn't delete the nodes \$\endgroup\$JVApen– JVApen2018年07月17日 18:13:40 +00:00Commented Jul 17, 2018 at 18:13
-
\$\begingroup\$ Ah! Thanks! I was thinking that clear would call the destructors, but obviously it doesn't since its values are pointers. \$\endgroup\$Reid Hayes– Reid Hayes2018年07月17日 18:16:59 +00:00Commented Jul 17, 2018 at 18:16
Here's an updated version that I made with the help of the answers JVApen and hoffmale posted.
Instead of storing the full string in terminal Trie nodes (this can actually increase the space complexity e.g. for the case of a unary tree), I modified prefix_apply
to generate this strings as needed. This also lets us do away with the inheritance.
unique_ptr
can take care of all the memory management.
I added an erase member function.
Definitely a big improvement! Thanks for the help!
#include <utility>
#include <map>
#include <string>
#include <iostream> // for example usage
struct TrieNode {
// unique_ptr + trees are a great match since every vertex has a unique parent
std::map<char, std::unique_ptr<TrieNode>> children;
};
class Trie {
private:
TrieNode root;
TrieNode const* find(std::string const& w) const {
// Return a pointer to the TrieNode for the last character in w, or nullptr if it doesn't exist.
TrieNode const* node = &root;
for (char const c: w) {
auto const it = node->children.find(c);
if (it != node->children.end()) {
node = it->second.get();
} else return nullptr;
}
return node;
}
template<typename StringFunction>
void prefix_apply_helper(TrieNode const* node, std::string& w, StringFunction& f) const{
// Recursively apply f to all strings at or beneath node in the Trie.
// w should be the (possibly incomplete) word represented by node.
if (node->children.find('0円') != node->children.end())
f(w);
for (auto const& cw: node->children) {
if (cw.first != '0円') {
w.push_back(cw.first);
prefix_apply_helper(cw.second.get(), w, f);
w.pop_back();
}
}
}
public:
void insert(std::string const& w) {
// Insert a word into the Trie.
TrieNode* node = & root;
for (char const c: w) {
auto const it = node->children.find(c);
if (it != node->children.end()) {
node = it->second.get();
} else {
auto* new_node = new TrieNode();
node->children[c] = std::unique_ptr<TrieNode>(new_node);
node = new_node;
}
}
node->children['0円'];
}
bool contains(std::string const& w) const {
// Check if the Trie contains the word
auto const* node = find(w);
if (node)
return node->children.find('0円') != node->children.end();
else return false;
}
void erase(std::string const& w) {
// Remove the word from the Trie.
TrieNode* node = &root;
bool one_child_chain = false;
TrieNode* one_child_chain_start = nullptr;
char one_child_chain_first = '0円';
for (char const c: w) {
auto const it = node->children.find(c);
if (it != node->children.end()) {
if (!one_child_chain && it->second->children.size() == 1) {
one_child_chain = true;
one_child_chain_start = node;
one_child_chain_first = c;
} else if (it->second->children.size() != 1) {
one_child_chain = false;
}
node = it->second.get();
} else return;
}
node->children.erase('0円');
if (one_child_chain)
one_child_chain_start->children.erase(one_child_chain_first);
}
template<typename StringFunction>
void prefix_apply(std::string prefix, StringFunction& f) const {
// Apply f to every word in the Trie that has the given prefix
auto const* node = find(prefix);
if (node)
prefix_apply_helper(node, prefix, f);
}
};
int main(int argc, const char * argv[]) {
Trie t;
t.insert("apple");
t.insert("app");
t.insert("appalachians");
t.insert("banana");
t.erase("appalachians");
t.insert("appalled");
t.prefix_apply("app", [](std::string const& w) { std::cout << w << '\n'; });
}
static_cast<TerminalTrieNode*>(_)->word
works. \$\endgroup\$