Implementing a Trie in Python using lists alone

Question 1

I've been working on implementing a Trie in Python for educational purposes. I tried implementing one using dictionaries and I was successful.

The structure of a trie with the words in, inn, inner, innerr would be as follows:

[['i', [['n', [['n', [['e', [['r', [['r', [], 'end']], 'end']]]], 'end']], 'end']]]]

where end indicates the end of a word.

class TrieException(Exception):
 pass
def add_word(word, trie):
 END = "end"
 if word == '':
 raise TrieException("word empty")
 prev = None
 branch = trie
 for i, c in enumerate(word):
 found = False
 for each in branch:
 if each[0] == c:
 if i == len(word)-1:
 if len(each) > 2:
 raise TrieException("Word already present")
 else:
 each.append(END)
 prev = branch
 branch = each[1]
 found = True
 break
 if not found:
 nb = []
 if i == len(word)-1:
 branch.append([c, nb, END])
 else:
 branch.append([c, nb])
 branch = nb
def search_word(word, trie):
 if word == '':
 raise TrieException("empty word")
 branch = trie
 for i, c in enumerate(word):
 found = False
 for each in branch:
 if each[0] == c:
 found = True
 branch = each[1]
 if i == len(word)-1:
 if len(each) <= 2:
 raise TrieException("Word not found")
 break
 if not found:
 raise TrieException("Word not found")

Do you have any suggestions on how I can do this in a cleaner fashion?

Question 2

Don't use generic-yet-novel exceptions like TrieException - it tells you nothing you didn't already know. Rather, reuse standard exceptions to give the user some way of distinguishing errors based off of meaning:

raise ValueError("word empty")
raise KeyError("Word already present")
raise ValueError("empty word")
raise KeyError("Word not found")

Note that your formatting is not standardized; some start with an uppercase and other identical errors have the word-order switched. This should be fixed:

raise ValueError("inserting empty word")
raise KeyError("word already present")
raise KeyError("word not found")

Note that a trie is basically a set, so you should probably try and copy the set interface. As such, there's little reason to not support having empty words and one would expect adding a key that already exists to be a silent operation (no error thrown).

Your version doesn't seem to offer support for empty strings, so let's re-examine. There are two common ways of supporting empty strings. The first is a format where the nodes have a boolean tagging whether they are terminal. This would look like:

 [False, · ]
 |
 (a,·)
 |
 [False, · ]
 |
 (n,·)
 |
 [False, · , · ]
 / \
 (a,·) (n,·)
 / \
 [True] [True, · , · ]
 / \
 (a,·) (e,·)
 / \
 [True] [True]

to store [ana, ann, anna, anne].

An alternative is doing away with those altogether and storing strings of the form [ana,ドル ann,ドル anna,ドル anne$] where the $ is a not a character but an end-of-string marker. This would give:

 [ · ]
 |
 (a,·)
 |
 [ · ]
 |
 (n,·)
 |
 [ · , · ]
 / \
 (a,·) (n,·)
 / \
 [ · ] [ · , · , · ]
 / / | \
 (,ドル·) (,ドル·) (a,.) (e,·)
 / / | \
 [ ] [ ] [ · ] [ · ]
 | \ 
 (,ドル·) (,ドル·)
 | |
 [ ] [ ]

This is perhaps the one you were trying to do with your 'end' string, but yours looks like

 [ · ]
 |
 [a,·]
 |
 [ · ]
 |
 [n,·]
 |
 [ · , · ]
 / \
 (a,·,$) (n,·,$)
 / \
 [ ] [ . , · ]
 / \
 (a,·,$) (e,·,$)
 / \
 [ ] [ ]

Note that you put the end marker in the edge with the previous character, not on its own. This adds more complexity to the design over the other options, and doesn't allow containing an empty string.

So let's reconsider the code

def add_word(word, trie):
 END = "end"

Note that UPPER_CASE is generally for global constants, which this should really be.

 if word == '':
 raise ValueError("inserting empty word")

We can discard this now, but note that not word would be more idiomatic.

 prev = None
 branch = trie

Note that current_branch or subtrie would be more descriptive.

 for i, c in enumerate(word):

You don't need to check if i == len(word)-1 if you're just adding a $ on the end. However, you don't need that anyway due to the else block for loops:

for i in j:
 if p(i):
 break
else:
 print("Didn't break")

Overall it looks like

def add_word(word, trie):
 subtrie = trie
 for c in word:
 for each in subtrie:
 if each[0] == c:
 subtrie = each[1]
 break
 else:
 nb = []
 subtrie.append([c, nb])
 subtrie = nb
 for each in subtrie:
 if each[0] == END:
 break
 else:
 subtrie.append([END, []])

You can use unpacking to get

...
for edge, node in subtrie:
 if edge == c:
 subtrie = node
 break
...
for edge, _ in subtrie:
 if edge == END:
 break
...

You can use any for the later:

if not any(edge == END for edge, _ in subtrie):
 subtrie.append([END, []])

and you can even use next for the former:

try:
 subtrie = next(node for edge, node in subtrie if edge == c)
except StopIteration:
 nb = []
 subtrie.append([c, nb])
 subtrie = nb

It'd be a neat idea to actually use dictionaries here, but if not one can at least use sorting and bisect. This would be better if END sorted nicely, so I'll change it to the empty string. This gives

from bisect import bisect_left
END = ''
def add_word(word, trie):
 subtrie = trie
 for char in word:
 index = bisect_left(subtrie, [char])
 if index < len(subtrie) and subtrie[index][0] == char:
 subtrie = subtrie[index][1]
 else:
 new_branch = []
 subtrie.insert(index, [char, new_branch])
 subtrie = new_branch
 if not (subtrie and subtrie[0][0] == END):
 subtrie.append([END, []])

search_word should really be called assert_word; it doesn't actually have an interface optimized for searching for words. I suppose you actually wanted something akin to __contains__, so contains_word would be appropriate if it actually returned its value.

The first lot of simplifications leads to

def contains_word(word, trie):
 subtrie = trie
 for char in word:
 for edge, node in subtrie:
 if edge == char:
 subtrie = node
 break
 else:
 return False
 return subtrie and subtrie[0][0] == END

and using bisect gets

def contains_word(word, trie):
 subtrie = trie
 for char in word:
 index = bisect_left(subtrie, [char])
 if not (index < len(subtrie) and subtrie[index][0] == char):
 return False
 subtrie = subtrie[index][1]
 return subtrie and subtrie[0][0] == END

You should consider making trie the first argument - it's a self-like argument so normally should be at the start.

Finally, consider replacing the edges with tuples, since their length is always 2.

Question 3

Exceptions are for exceptional (read 'rare','unexpected') edge cases

You are using exceptions as the default behaviour, this is not nice, instead functions should return values, remove TrieException("word not find") from search_word(word, trie) and instead return True if you find the word and False if you do not find it.

Confusing alias

branch = trie

You just confuse the reader, each thing should have one name only.

Question 4

I disagree with both explanations; Python has a well known EAFTP preference and aliases are fine. I agree with these in context, though; I just don't think the explanations make it clear why.

Question 5

@Veedrac I am interested in your comment? How should I change the explanations in your opinion?

Question 6

For the first point, I would mention what you should do (have a well-defined interface that matches the name - use errors on failure to do that task). For the second, after doing a proper read of the code I think aliasing is appropriate, but the name it's aliased to is bad. I make similar points in my answer ("Note that current_branch or subtrie would be more descriptive." and "search_word should really be called assert_word").

Veedrac Veedrac 9,77323 silver badges38 bronze badges · Accepted Answer · 2015-03-30 12:43:07Z

Don't use generic-yet-novel exceptions like TrieException - it tells you nothing you didn't already know. Rather, reuse standard exceptions to give the user some way of distinguishing errors based off of meaning:

raise ValueError("word empty")
raise KeyError("Word already present")
raise ValueError("empty word")
raise KeyError("Word not found")

Note that your formatting is not standardized; some start with an uppercase and other identical errors have the word-order switched. This should be fixed:

raise ValueError("inserting empty word")
raise KeyError("word already present")
raise KeyError("word not found")

Note that a trie is basically a set, so you should probably try and copy the set interface. As such, there's little reason to not support having empty words and one would expect adding a key that already exists to be a silent operation (no error thrown).

Your version doesn't seem to offer support for empty strings, so let's re-examine. There are two common ways of supporting empty strings. The first is a format where the nodes have a boolean tagging whether they are terminal. This would look like:

 [False, · ]
 |
 (a,·)
 |
 [False, · ]
 |
 (n,·)
 |
 [False, · , · ]
 / \
 (a,·) (n,·)
 / \
 [True] [True, · , · ]
 / \
 (a,·) (e,·)
 / \
 [True] [True]

to store [ana, ann, anna, anne].

An alternative is doing away with those altogether and storing strings of the form [ana,ドル ann,ドル anna,ドル anne$] where the $ is a not a character but an end-of-string marker. This would give:

 [ · ]
 |
 (a,·)
 |
 [ · ]
 |
 (n,·)
 |
 [ · , · ]
 / \
 (a,·) (n,·)
 / \
 [ · ] [ · , · , · ]
 / / | \
 (,ドル·) (,ドル·) (a,.) (e,·)
 / / | \
 [ ] [ ] [ · ] [ · ]
 | \ 
 (,ドル·) (,ドル·)
 | |
 [ ] [ ]

This is perhaps the one you were trying to do with your 'end' string, but yours looks like

 [ · ]
 |
 [a,·]
 |
 [ · ]
 |
 [n,·]
 |
 [ · , · ]
 / \
 (a,·,$) (n,·,$)
 / \
 [ ] [ . , · ]
 / \
 (a,·,$) (e,·,$)
 / \
 [ ] [ ]

Note that you put the end marker in the edge with the previous character, not on its own. This adds more complexity to the design over the other options, and doesn't allow containing an empty string.

So let's reconsider the code

def add_word(word, trie):
 END = "end"

Note that UPPER_CASE is generally for global constants, which this should really be.

 if word == '':
 raise ValueError("inserting empty word")

We can discard this now, but note that not word would be more idiomatic.

 prev = None
 branch = trie

Note that current_branch or subtrie would be more descriptive.

 for i, c in enumerate(word):

You don't need to check if i == len(word)-1 if you're just adding a $ on the end. However, you don't need that anyway due to the else block for loops:

for i in j:
 if p(i):
 break
else:
 print("Didn't break")

Overall it looks like

def add_word(word, trie):
 subtrie = trie
 for c in word:
 for each in subtrie:
 if each[0] == c:
 subtrie = each[1]
 break
 else:
 nb = []
 subtrie.append([c, nb])
 subtrie = nb
 for each in subtrie:
 if each[0] == END:
 break
 else:
 subtrie.append([END, []])

You can use unpacking to get

...
for edge, node in subtrie:
 if edge == c:
 subtrie = node
 break
...
for edge, _ in subtrie:
 if edge == END:
 break
...

You can use any for the later:

if not any(edge == END for edge, _ in subtrie):
 subtrie.append([END, []])

and you can even use next for the former:

try:
 subtrie = next(node for edge, node in subtrie if edge == c)
except StopIteration:
 nb = []
 subtrie.append([c, nb])
 subtrie = nb

It'd be a neat idea to actually use dictionaries here, but if not one can at least use sorting and bisect. This would be better if END sorted nicely, so I'll change it to the empty string. This gives

from bisect import bisect_left
END = ''
def add_word(word, trie):
 subtrie = trie
 for char in word:
 index = bisect_left(subtrie, [char])
 if index < len(subtrie) and subtrie[index][0] == char:
 subtrie = subtrie[index][1]
 else:
 new_branch = []
 subtrie.insert(index, [char, new_branch])
 subtrie = new_branch
 if not (subtrie and subtrie[0][0] == END):
 subtrie.append([END, []])

search_word should really be called assert_word; it doesn't actually have an interface optimized for searching for words. I suppose you actually wanted something akin to __contains__, so contains_word would be appropriate if it actually returned its value.

The first lot of simplifications leads to

def contains_word(word, trie):
 subtrie = trie
 for char in word:
 for edge, node in subtrie:
 if edge == char:
 subtrie = node
 break
 else:
 return False
 return subtrie and subtrie[0][0] == END

and using bisect gets

def contains_word(word, trie):
 subtrie = trie
 for char in word:
 index = bisect_left(subtrie, [char])
 if not (index < len(subtrie) and subtrie[index][0] == char):
 return False
 subtrie = subtrie[index][1]
 return subtrie and subtrie[0][0] == END

You should consider making trie the first argument - it's a self-like argument so normally should be at the start.

Finally, consider replacing the edges with tuples, since their length is always 2.

Stack Exchange Network

Implementing a Trie in Python using lists alone

2 Answers 2

Exceptions are for exceptional (read 'rare','unexpected') edge cases

Confusing alias

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing a Trie in Python using lists alone

2 Answers 2

Exceptions are for exceptional (read 'rare','unexpected') edge cases

Confusing alias

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions