I implemented a trie in Python2.7 as a dictionary of dictionaries by extending the UserDict class to allow native dictionary access syntax and iteration.
For example, trie['stack']
produces a subtrie of all words that begin with 'stack', 'stack' in trie
checks for containment in the trie, and for word in trie.get_words('stack'):
iterates over all words that start with 'stack', among other features (github.com/sufyanAbbasi/dictrie):
from UserDict import UserDict
from collections import deque
class Dictrie(UserDict, object):
def __init__(self, *wordslists, **kwargs):
init_trie = kwargs.get('dict', {})
super(Dictrie, self).__init__(init_trie)
for words in wordslists:
self.build_trie(words)
# returns if word is a valid word in the dictionary
def is_word(self, word):
return word in self and ' ' in self[word]
# returns a generator to produce all words in the trie beginning with
# the root string from shortest to longest
def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
# builds the trie given an iterator of words
def build_trie(self, words):
words = list(words)
for word in words:
self[' '] = word
#iterates over all the words in the dictionary
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __contains__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
return False
return True
def __getitem__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
raise KeyError(key)
return sub_trie
def __setitem__(self, key, item):
sub_trie = self.data
for char in item.strip() + ' ':
sub_trie = sub_trie.setdefault(char, {})
There are a few gripes that I have with my code that I'd like suggestions on, in order of priority:
Performance
The
get_word
and__iter__
generators use a BFS algorithm that repeats a lot of computation, however it has the advantage of being simple to read. As the trie is parsed in breadth-first style, the entire word is appended to the end of the queue instead of caching the current dictionary and starting there, so it repeats the computation. Is the efficient strategy to append a tuple containing the parsed word and a cached dictionary?lru_cache
decorator? Recursion, maybe? Here are the methods in question:def get_words(self, root): queue = deque([root]) while queue: curr_str = queue.popleft() if not self[curr_str]: yield curr_str.strip() else: queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys())) def __iter__(self): queue = deque(sorted(self.iterkeys())) while queue: curr_str = queue.popleft() if not self[curr_str]: yield curr_str.strip() else: queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
Design
The
get_word
and__iter__
methods, are exactly the same BFS algorithm, except for the first line in which the initial queue inget_word
is the root string where__iter__
starts the queue with the first keys in the trie (so all words are iterated over). There's an opportunity for code reduction by setting one equal to the other, but I can't seem to come up with a clever way to do so, given that one takes in a parameter and the other does not.Accessing a subtrie such as
trie['stack']
returns a nested dictionary instead of a Dictrie object, is that a design flaw? As it is, it allows the user to employ their own trie traversal algorithms and create a Dictrie object if they need to, however it creates unexpected behavior when performing iteration on it such asfor words in trie['stack']
. Should indexing into the Dictrie class return a Dictrie or a dictionary of dictionaries? The weird thing is, they are almost exactly the same!What other useful features should a trie implementation have?
1 Answer 1
For the redundancy between get_words
and __iter__
, you could add another option to get_words
that allows directly passing a queue
(and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
-
\$\begingroup\$ I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given to
get_words()
, which would mimic the__iter__
function. Thanks so much! I will mark this answered tomorrow if there are no more bites. \$\endgroup\$sufjan– sufjan2018年01月10日 00:51:08 +00:00Commented Jan 10, 2018 at 0:51 -
\$\begingroup\$ @sufjan Than you would not even need the second argument and XOR anymore and could even declare
__iter__ = get_words
. \$\endgroup\$Graipher– Graipher2018年01月10日 06:31:29 +00:00Commented Jan 10, 2018 at 6:31 -
\$\begingroup\$ In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks! \$\endgroup\$sufjan– sufjan2018年01月11日 00:25:08 +00:00Commented Jan 11, 2018 at 0:25
Explore related questions
See similar questions with these tags.