I've been working on implementing a Trie in Python for educational purposes. I tried implementing one using dictionaries and I was successful.
The structure of a trie with the words in
, inn
, inner
, innerr
would be as follows:
[['i', [['n', [['n', [['e', [['r', [['r', [], 'end']], 'end']]]], 'end']], 'end']]]]
where end
indicates the end of a word.
class TrieException(Exception):
pass
def add_word(word, trie):
END = "end"
if word == '':
raise TrieException("word empty")
prev = None
branch = trie
for i, c in enumerate(word):
found = False
for each in branch:
if each[0] == c:
if i == len(word)-1:
if len(each) > 2:
raise TrieException("Word already present")
else:
each.append(END)
prev = branch
branch = each[1]
found = True
break
if not found:
nb = []
if i == len(word)-1:
branch.append([c, nb, END])
else:
branch.append([c, nb])
branch = nb
def search_word(word, trie):
if word == '':
raise TrieException("empty word")
branch = trie
for i, c in enumerate(word):
found = False
for each in branch:
if each[0] == c:
found = True
branch = each[1]
if i == len(word)-1:
if len(each) <= 2:
raise TrieException("Word not found")
break
if not found:
raise TrieException("Word not found")
Do you have any suggestions on how I can do this in a cleaner fashion?
2 Answers 2
Don't use generic-yet-novel exceptions like TrieException
- it tells you nothing you didn't already know. Rather, reuse standard exceptions to give the user some way of distinguishing errors based off of meaning:
raise ValueError("word empty")
raise KeyError("Word already present")
raise ValueError("empty word")
raise KeyError("Word not found")
Note that your formatting is not standardized; some start with an uppercase and other identical errors have the word-order switched. This should be fixed:
raise ValueError("inserting empty word")
raise KeyError("word already present")
raise KeyError("word not found")
Note that a trie is basically a set
, so you should probably try and copy the set
interface. As such, there's little reason to not support having empty words and one would expect adding a key that already exists to be a silent operation (no error thrown).
Your version doesn't seem to offer support for empty strings, so let's re-examine. There are two common ways of supporting empty strings. The first is a format where the nodes have a boolean tagging whether they are terminal. This would look like:
[False, · ]
|
(a,·)
|
[False, · ]
|
(n,·)
|
[False, · , · ]
/ \
(a,·) (n,·)
/ \
[True] [True, · , · ]
/ \
(a,·) (e,·)
/ \
[True] [True]
to store [ana, ann, anna, anne]
.
An alternative is doing away with those altogether and storing strings of the form [ana,ドル ann,ドル anna,ドル anne$]
where the $
is a not a character but an end-of-string marker. This would give:
[ · ]
|
(a,·)
|
[ · ]
|
(n,·)
|
[ · , · ]
/ \
(a,·) (n,·)
/ \
[ · ] [ · , · , · ]
/ / | \
(,ドル·) (,ドル·) (a,.) (e,·)
/ / | \
[ ] [ ] [ · ] [ · ]
| \
(,ドル·) (,ドル·)
| |
[ ] [ ]
This is perhaps the one you were trying to do with your 'end'
string, but yours looks like
[ · ]
|
[a,·]
|
[ · ]
|
[n,·]
|
[ · , · ]
/ \
(a,·,$) (n,·,$)
/ \
[ ] [ . , · ]
/ \
(a,·,$) (e,·,$)
/ \
[ ] [ ]
Note that you put the end marker in the edge with the previous character, not on its own. This adds more complexity to the design over the other options, and doesn't allow containing an empty string.
So let's reconsider the code
def add_word(word, trie):
END = "end"
Note that UPPER_CASE
is generally for global constants, which this should really be.
if word == '':
raise ValueError("inserting empty word")
We can discard this now, but note that not word
would be more idiomatic.
prev = None
branch = trie
Note that current_branch
or subtrie
would be more descriptive.
for i, c in enumerate(word):
You don't need to check if i == len(word)-1
if you're just adding a $
on the end. However, you don't need that anyway due to the else
block for loops:
for i in j:
if p(i):
break
else:
print("Didn't break")
Overall it looks like
def add_word(word, trie):
subtrie = trie
for c in word:
for each in subtrie:
if each[0] == c:
subtrie = each[1]
break
else:
nb = []
subtrie.append([c, nb])
subtrie = nb
for each in subtrie:
if each[0] == END:
break
else:
subtrie.append([END, []])
You can use unpacking to get
...
for edge, node in subtrie:
if edge == c:
subtrie = node
break
...
for edge, _ in subtrie:
if edge == END:
break
...
You can use any
for the later:
if not any(edge == END for edge, _ in subtrie):
subtrie.append([END, []])
and you can even use next
for the former:
try:
subtrie = next(node for edge, node in subtrie if edge == c)
except StopIteration:
nb = []
subtrie.append([c, nb])
subtrie = nb
It'd be a neat idea to actually use dictionaries here, but if not one can at least use sorting and bisect
. This would be better if END
sorted nicely, so I'll change it to the empty string. This gives
from bisect import bisect_left
END = ''
def add_word(word, trie):
subtrie = trie
for char in word:
index = bisect_left(subtrie, [char])
if index < len(subtrie) and subtrie[index][0] == char:
subtrie = subtrie[index][1]
else:
new_branch = []
subtrie.insert(index, [char, new_branch])
subtrie = new_branch
if not (subtrie and subtrie[0][0] == END):
subtrie.append([END, []])
search_word
should really be called assert_word
; it doesn't actually have an interface optimized for searching for words. I suppose you actually wanted something akin to __contains__
, so contains_word
would be appropriate if it actually returned its value.
The first lot of simplifications leads to
def contains_word(word, trie):
subtrie = trie
for char in word:
for edge, node in subtrie:
if edge == char:
subtrie = node
break
else:
return False
return subtrie and subtrie[0][0] == END
and using bisect
gets
def contains_word(word, trie):
subtrie = trie
for char in word:
index = bisect_left(subtrie, [char])
if not (index < len(subtrie) and subtrie[index][0] == char):
return False
subtrie = subtrie[index][1]
return subtrie and subtrie[0][0] == END
You should consider making trie
the first argument - it's a self
-like argument so normally should be at the start.
Finally, consider replacing the edges with tuples, since their length is always 2
.
Exceptions are for exceptional (read 'rare','unexpected') edge cases
You are using exceptions as the default behaviour, this is not nice, instead functions should return values, remove TrieException("word not find")
from search_word(word, trie)
and instead return True if you find the word and False if you do not find it.
Confusing alias
branch = trie
You just confuse the reader, each thing should have one name only.
-
\$\begingroup\$ I disagree with both explanations; Python has a well known EAFTP preference and aliases are fine. I agree with these in context, though; I just don't think the explanations make it clear why. \$\endgroup\$Veedrac– Veedrac2015年03月29日 15:21:01 +00:00Commented Mar 29, 2015 at 15:21
-
\$\begingroup\$ @Veedrac I am interested in your comment? How should I change the explanations in your opinion? \$\endgroup\$Caridorc– Caridorc2015年03月30日 20:14:09 +00:00Commented Mar 30, 2015 at 20:14
-
1\$\begingroup\$ For the first point, I would mention what you should do (have a well-defined interface that matches the name - use errors on failure to do that task). For the second, after doing a proper read of the code I think aliasing is appropriate, but the name it's aliased to is bad. I make similar points in my answer ("Note that
current_branch
orsubtrie
would be more descriptive." and "search_word
should really be calledassert_word
"). \$\endgroup\$Veedrac– Veedrac2015年03月31日 11:05:02 +00:00Commented Mar 31, 2015 at 11:05