I've written a tree data structure that takes in input from the user, and stores the each letter based on the previous letter in the text. I'm not sure what this is called, but the closest I can find is an auto complete type of tree, where someone would traverse this tree to find the next letter with the highest frequency, then build off that to form a word to suggest.
I'm mainly looking for performace/algorithm critiques and suggestions, specifically in the add_node
function. This is where I had the most problems while writing this, and the excess amount of comments are me "thinking aloud" to myself while working through the problem.
"""
Letter Frequency Tree. This populates a tree based on the users input, keeping track of words entered and the
frequency of each letter the user inputs.
@author Ben Antonellis
"""
import string
class Node:
def __init__(self, letter: str) -> None:
self.letter = letter
self.count = 0
self.nodes = []
def add_node(self, node: object) -> None:
# Check if the string is currently empty:
if node.letter == "":
# Now we return since we're all out of letters
return
# Determine if letter is already a child node
next_node = self.__has_letter_node(node.letter[0])
# If there is no child node for that letter
if next_node == None:
# We need to create a new node with the new letter
new_node = Node(node.letter[0])
# Because there is no child, we increment the count by 1
new_node.count += 1
# Now we add the node to the nodes list
self.nodes.append(new_node)
# Now we need to work with the rest of the letters on the new node, skipping the front letter
new_node.add_node(Node(node.letter[1:]))
# If there is a child node present
else:
# We need to add the count to that letter
next_node.count += 1
# Now we add the next letters to the current node
next_node.add_node(Node(node.letter[1:]))
def __has_letter_node(self, letter: str) -> object | None:
for node in self.nodes:
if node.letter == letter:
return node
return None
def has_nodes(self) -> bool:
return self.nodes != []
def __str__(self, level=0) -> str:
# Martijn Pieters: https://stackoverflow.com/a/20242504/8968906
ret = "\t" * level + repr(self.letter) + str(self.count) + "\n"
for child in self.nodes:
ret += child.__str__(level + 1)
return ret
def __repr__(self) -> str:
return self.__str__()
class AutoCompleteTree:
def __init__(self) -> None:
self.roots = [ Node(c) for c in string.ascii_lowercase ]
def get_root_letter_node(self, letter: str) -> Node:
for node in self.roots:
if node.letter == letter:
return node
def print_tree(self) -> None:
for root in self.roots:
if root.has_nodes():
print(root)
def store_input(self, text: str) -> None:
node = self.get_root_letter_node(text[0])
node.count += 1
node.add_node(Node(text[1:]))
def main():
tree = AutoCompleteTree()
while True:
text = input(">>> ")
tree.store_input(text)
tree.print_tree()
if __name__ == '__main__':
main()
1 Answer 1
All in all I think your implementation is well done. Clearly, if I didn't have at least a few comments, I wouldn't be taking the trouble to post an answer. Admittedly some (hopefully not all) of my comments could be considered "nitpicking."
Improved Data Structures
The Node.nodes
attribute is a list
when, as I believe you now know, making it a dict
that maps a letter to a Node
instance would permit faster searching. Likewise for the AutoCompleteTree.roots
attribute which should also be a dictionary that maps a letter to a Node
instance.
Type Hints
If you are going to do use type hints, there might be some room for improvement. Function __has_letter_node
is defined as follows:
def __has_letter_node(self, letter: str) -> object | None:
I believe ideally you would want to use -> Node | None
, but Node
has not been defined yet (also using operator |
is not supported on my Python 3.8.5). So the changes I had to make for this to run was to predefine an empty class Node
and then to redefine it. There may be a better, alternate method that I am not aware of:
from typing import Union # Needed for Python < 3.10
# So we can reference Node in class Node:
class Node:
pass
# Redefinition of Node:
class Node:
...
def has_letter_node(self, letter: str) -> Union[Node, None]:
But in fact, if we use the correct data structure for Node.nodes
, then we don't even really need this function, a call to which can be replaced with the expression self.nodes.get(letter)
, which will return either the Node
instance or None
.
Don't Do More Work Than Necessary
AutoCompleteTree.store_input
calls node.add_node(Node(text[1:]))
and in method Node.add_node
we have the recursive calls new_node = Node(node.letter[0])
and next_node.add_node(Node(node.letter[1:]))
. In both instances you are first constructing a Node
instance to be passed to add_node
but add_node
is only referencing the string held in node.letter
. There is not much point in constructing a Node
when we could simply be passing instead the string that would otherwise be stored in node.letter
. Also, I would suggest that a better name for the method is Node.add_nodes
because the method is called in general to add multiple nodes (this is one of the nitpicking recommendations!).
Remove Tail Recursion
Method Node.add_node
uses recursion of a special form, i.e. tail recursion. This is where the recursive calls are the final action performed in the function/method before it returns. With such functions iteration can easily replace recursion making for a more efficient implementation.
Some More Nitpicking Recommendations
Most of the attributes and methods used in these two classes are not to be accessed by client code and as such should have names starting with an underscore ('_') to suggest that these entities are "private."
You have taken a standard instance method,
__str__
and added an additional, nonstandard argument, level. My preference would be to instead define a new private method_to_string
:
def _to_string(self, level=0):
# Martijn Pieters: https://stackoverflow.com/a/20242504/8968906
the_string = "\t" * level + repr(self.letter) + str(self.count) + "\n"
for child in self.nodes.values():
the_string += child._to_string(level + 1)
return the_string
def __repr__(self) -> str:
return self._to_string()
Method __repr__
just delegates to the new method. We don't need to define a __str__
method since Node.__repr__
will be called in its absence. Note that I have also replaced variable name ret
with the_string
.
- Method
Node.has_nodes
could be coded as:
def has_nodes(self) -> bool:
return True if self.nodes else False
This works whether self.nodes
is a list or dictionary.
- Your
main
function loops indefinitely. You might want to test and terminate for empty input.
Suggestions Put Into Code
If you were to adopt all of the above suggestions, the code would look like the following (it may also assist you if some of my comments above were not as clear as I would have liked):
"""
Letter Frequency Tree. This populates a tree based on the users input, keeping track of words entered and the
frequency of each letter the user inputs.
@author Ben Antonellis
"""
import string
class _Node:
def __init__(self, letter: str) -> None:
self._letter = letter
self._count = 0
self._nodes = {}
def add_nodes(self, text: str) -> None:
current_node = self
# Check if the string is currently empty:
for letter in text:
# Determine if letter is already a child node
next_node = current_node._nodes.get(letter)
# If there is no child node for that letter
if next_node is None:
# We need to create a new node with the new letter
new_node = _Node(letter)
# Because there is no child, we increment the count by 1
new_node._count += 1
# Now we add the node to the nodes list
current_node._nodes[letter] = new_node
# Now we need to work with the rest of the letters on the new node, skipping the front letter
current_node = new_node
# If there is a child node present
else:
# We need to add the count to that letter
next_node._count += 1
# Now we add the next letters to the current node
current_node = next_node
def has_nodes(self) -> bool:
return True if self._nodes else False
def _to_string(self, level=0):
# Martijn Pieters: https://stackoverflow.com/a/20242504/8968906
the_string = "\t" * level + repr(self._letter) + str(self._count) + "\n"
for child in self._nodes.values():
the_string += child._to_string(level + 1)
return the_string
def __repr__(self) -> str:
return self._to_string()
class AutoCompleteTree:
def __init__(self) -> None:
self._roots = {c: _Node(c) for c in string.ascii_lowercase}
def _get_root_letter_node(self, letter: str) -> _Node:
return self._roots[letter]
def print_tree(self) -> None:
for root in self._roots.values():
if root.has_nodes():
print(root)
def store_input(self, text: str) -> None:
node = self._get_root_letter_node(text[0])
node._count += 1
node.add_nodes(text[1:])
def main():
tree = AutoCompleteTree()
while True:
text = input(">>> ")
if not text:
break
tree.store_input(text)
tree.print_tree()
if __name__ == '__main__':
main()
-
\$\begingroup\$
Node | None
is syntactic sugar forUnion[Node, None]
in type hints. And much more readable. See here: docs.python.org/3/library/typing.html#typing.Union \$\endgroup\$Cris Luengo– Cris Luengo2023年07月03日 02:59:17 +00:00Commented Jul 3, 2023 at 2:59 -
\$\begingroup\$ @CrisLuengo Thanks for the info. It first generated an error for me due to class
Node
not having been defined yet. That is why I first defined an empty classNode
and then redefined the class. Then I got: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'. UsingUnion
resolved that for me (Python 3.8.5). \$\endgroup\$Booboo– Booboo2023年07月03日 09:30:30 +00:00Commented Jul 3, 2023 at 9:30
is None
and== None
why have you chosen the latter? Why have you chosenself.nodes
to be alist
not adict
? Why have you got a ton of comments inadd_node
but none in the other methods? \$\endgroup\$add_node
are essentially me thinking aloud, as I had the most trouble with that part (which could definitely be mentioned in the question, now that I think of it).self.nodes
is a list because it's the first collection I thought of to store child nodes. I forgot about usingis None
so that's definitely something I can fix on my end. \$\endgroup\$is None
and similar details. \$\endgroup\$