Autocomplete Trie Optimization

Question 1

I'm currently playing with the Typeahead problem on Talentbuddy. There are already 2 questions on this subject (Typeahead Talent Buddy and Typeahead autocomplete functionality challenge), but none of these use a trie which is why I created a new one.

Problem Statement

Given a big list of user names and a list of strings representing queries Your task is to:

Write a function that prints to the standard output (stdout) for each query the user name that matches the query

If there are multiple user names matching the query please select the one that is the smallest lexicographically

All string matches must be case insensitive

If no match is found for a given query please print "-1"

Note that your function will receive the following arguments: usernames

Which is an array of strings representing the user names queries

Which is an array of strings representing the queries described above

Data constraints

The length of the array above will not exceed 100,000 entries

Each name or query string will not exceed 30 characters

Efficiency constraints

Your function is expected to print the requested result and return in less than 2 seconds

Example
names: ["james", "jBlank"]
queries: ["j", "jm", "jbl", "JB"]
Output: 
 james
 -1
 jBlank
 jBlank"

I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.

I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed improvement. However, my implementation still barely makes the 2 second time limit on my machine (1.1 seconds on average with cpython) and doesn't make it on the Talentbuddy server.

Current Implementation:

My algorithm works by first creating a prefix tree (trie) with the usernames. The trie is build from dicts, where a dict entry for a letter references it's corresponding trie node (which is a dict itself). Each node also stores the lexicographic smallest word matching the prefix used to get to the node in it's dict with a special key ("_word").

Creating the trie works by traversing the trie letter by letter for each word and overwriting the "_word" mapping in each encountered node with the current word. Because the word list is sorted in reverse order, this ensures each node will store the lexicographic smallest word with the prefix used to arrive there.

The trie is queried by traversing it using the letters from the query. The node we arrive at the last letter stores the lexicographic smallest username matching this prefix.

class Trie(object):
 """ Simple trie implementation for autocompletion.
 Build this trie once by instanciating it with a list of words.
 You can then get the smalles lexicographic match by calling the by_prefix
 method.
 Note: While storing text unaltered, the trie ignores case.
 """
 _word = "_word"
 def __init__(self, words):
 self.root = {}
 # presorting ensures every time we traverse a node the current word will 
 # be lexicographically smaller, thus we can always replace it.
 words.sort(key=str.lower, reverse=True)
 for word in words:
 node = self.root
 for char in word.lower():
 node = node.setdefault(char, {}) 
 node[self._word] = word
 def by_prefix(self, prefix):
 """Return lexicographically smallest word that starts with a given
 prefix.
 """
 curr = self.root
 for char in prefix.lower():
 if char in curr:
 curr = curr[char]
 else:
 return "-1"
 else:
 return curr[self._word]
def typeahead(usernames, queries):
 """Given a list of users and queries,
 this function prints all valid users for these queries.
 Args:
 usernames: list of strings representing users.
 queries: list of strings representing (incomplete) input
 """
 users = Trie(usernames)
 print "\n".join(users.by_prefix(q) for q in queries)

Complexity

The worst case complexity for a lookup in the trie now is O(n) where n is the length of a query (which can't be larger than 30).

Creating the trie is a bit worse, with O(n*m) where n is the size of the username list and m the average length of a username.

Optimization

I have used the line profiler to find out where most time is spent. I have modified the typeahead function to be able to pinpoint the bottlenecks better:

Total time: 1.01042 s
File: typeahead.py
Function: __init__ at line 13
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 13 @profile
 14 def __init__(self, words):
 15 1 1 1.0 0.0 self.root = {}
 16 # presorting ensures every time we traverse a node the current word will
 17 # be lexicographically smaller, thus we can always replace it.
 18 1 38859 38859.0 3.8 words.sort(key=str.lower, reverse=True)
 19 50001 22553 0.5 2.2 for word in words:
 20 50000 20530 0.4 2.0 node = self.root
 21 615946 249450 0.4 24.7 for char in word.lower():
 22 565946 419829 0.7 41.6 node = node.setdefault(char, {})
 23 565946 259193 0.5 25.7 node[self._word] = word
Total time: 0.402196 s
File: typeahead.py
Function: by_prefix at line 25
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 25 @profile
 26 def by_prefix(self, prefix):
 27 """Return lexicographically smallest word that starts with a given
 28 prefix.
 29 """
 30 50000 19609 0.4 4.9 curr = self.root
 31 332685 130631 0.4 32.5 for char in prefix.lower():
 32 288221 114342 0.4 28.4 if char in curr:
 33 282685 112530 0.4 28.0 curr = curr[char]
 34 else:
 35 5536 1950 0.4 0.5 return "-1"
 36 else:
 37 44464 23134 0.5 5.8 return curr[self._word]
Total time: 3.46204 s
File: typeahead.py
Function: typeahead at line 39
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 39 @profile
 40 def typeahead(usernames, queries):
 41 """Given a list of users and queries,
 42 this function prints all valid users for these queries.
 43 
 44 Args:
 45 usernames: list of strings representing users.
 46 queries: list of strings representing (incomplete) input
 47 """
 48 1 1745292 1745292.0 50.4 users = Trie(usernames)
 49 1 3 3.0 0.0 results = (users.by_prefix(q) for q in queries)
 50 1 844213 844213.0 24.4 output = "\n".join(results)
 51 1 872537 872537.0 25.2 print output

Possible Bottlenecks

Dictionary creation takes almost 40% of the cost of building the trie. Is there any other data structure in Python suited for fast lookup? In C I would use an array of pointers, but there is no such thing in Python and from looking at the alternatives I gather that classes or named tuples also use a dict to perform member dereferencing. Would using a list and indexing it by (letter - ord('a')) be faster?
When performing the lookups in the trie most of the cost (almost 60%) comes from dictionary lookups. Solving Bottleneck 1 would most likely also solve this one.
25% of overall time is spent printing the output to the terminal. Unfortunately there is not much we can do about that. I already "optimized" this by doing only 1 call to print, because it blocks on every newline... This is somewhat of a problem with the format talentbuddy uses for output validation. However I hope they at least don't use a terminal on the validation machine but redirect stdout.

Do you have any idea how I can further speed this up? I expect it needs to run at least 2x faster to pass on the Talentbuddy machine.

Question 2

Do it the simple way: make the usernames into a list of tuples (username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd get

index = [("james", "james"),
 ("jblank", "jBlank")]

Now, use the bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to

i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
 found = index[i]
 if found[0].startswith(prefix):
 return found[1]
return '-1'

The returned value is tuple ('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.

Adapting this into a class one gets:

from bisect import bisect_left
names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]
class Index(object):
 def __init__(self, words):
 self.index = [ (w.lower(), w) for w in words ]
 self.index.sort()
 def by_prefix(self, prefix):
 """Return lexicographically smallest word that starts with a given
 prefix.
 """
 prefix = prefix.lower()
 i = bisect_left(self.index, (prefix, ''))
 if 0 <= i < len(self.index):
 found = self.index[i]
 if found[0].startswith(prefix):
 return found[1]
 return '-1'
def typeahead(usernames, queries):
 """Given a list of users and queries,
 this function prints all valid users for these queries.
 Args:
 usernames: list of strings representing users.
 queries: list of strings representing (incomplete) input
 """
 users = Index(usernames)
 print("\n".join(users.by_prefix(q) for q in queries))
typeahead(names, queries)

This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.

In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!

class Index(object):
 def __init__(self, words):
 index = {}
 for w in sorted(words, key=str.lower, reverse=True):
 lw = w.lower()
 for i in range(1, len(lw) + 1):
 index[lw[:i]] = w
 self.index = index
 def by_prefix(self, prefix):
 """Return lexicographically smallest word that starts with a given
 prefix.
 """
 return self.index.get(prefix.lower(), '-1')

Question 3

This is a very cool solution to the problem, I like the simplicity. However, it is still not fast enough. When removing the print statement and running it 100 times it takes an average of 168ms on my machine. My trie takes 238ms in the same conditions. So while it is about 30% faster, it still doesn't pass on the Talentbuddy machine. I wonder If it's actually possible to solve this problem much faster in Python. I think the trie could still be much faster if it wasn't using dicts as nodes. However, I can't think of a better alternative.

Question 4

You did run all the queries on 1 already generated datastructure, yes? And no, trie cannot be much faster; bisect has constantish branching factor of 17 for 100000 usernames, your trie depends on the number of characters...

Question 5

I cannot think anything that would be faster for lookups than my dictionary approach, yet it would be easier to generate than your trie. Even if I tried to micro-optimize the creation loop it did not have any significance.

Question 6

I called the typeahead function 100 times from ipython with the %timeit function. Since typeahead generates the datastructure for each call I think the benchmark is accurate. I'm not saying I think a trie is easier than your approach, actually i really like the simplicity of it. However, I think an efficient trie implementation could be faster. As you stated before, lookups in a trie depend on the size of the query, which have an average length of 6 in the test data set. Bisecting the tree takes 17 operations for each query so it should be slower.

Question 7

Hem, just noticed that you said: "typeahead generates the datastructure for each call I think the benchmark is accurate", of course you should not recreate the data structure for each lookup, that is not what lookup means...

Question 8

Suppose you revert to c-style array of pointers. Except by pointers I mean signed 32-bit integers, and by array I mean array.

You can use the built-in array type, initialized to all zeroes, which will be your null pointer.

Use a separate dict to store correctly spelled (i.e., capitalized) words. You can use the node index values to key the dictionary.

The result of this should be a single data structure, accessed using integer indices, that you can treat in a trie-like manner. It should be fairly quick to generate, and to traverse. It will NOT be any kind of memory efficient, however. If you're doing this for a gaming site, I suspect that some of the test cases will be enormously long, so this might be an issue.

(NB: I haven't run or tested this code.)

import array
import string
NODE_SIZE = 26
Data = array.array('I', [0]) * 4_000_000
Last_alloc = 0
Spelling = dict() # index -> word
def alloc_node():
 """Return first index of new-allocated trie node.
 """
 global Last_alloc
 alloc = Last_alloc
 Last_alloc += NODE_SIZE
 return alloc
Root = alloc_node()
def lowercase_offsets(s, lc=string.ascii_lowercase):
 """Convert a string to offsets from 'a' of lowercase version:
 'Cat' -> [2, 0, 20]
 """
 return [lc.index(ch) for ch in s.lower()]
def insert(word):
 """Insert a word into the trie.
 """
 node = Root
 for offset in lowercase_offsets(word):
 # Here, node points to start of 26 pointers for children
 # a..z of this node in trie.
 node += offset
 if Data[node] == 0:
 Data[node] = alloc_node()
 node = Data[node]
 else:
 node = Data[node]
 # Here node points to start of node of last letter in word.
 Spelling[node] = word
def query(q):
 """Return the first word greater than the query string, or None.
 """
 node = Root
 for offset in lowercase_offsets(q):
 node += offset
 if Data[node] == 0:
 return None
 node = Data[node]
 # Here we have reached the end of the query. Find the first node
 # in the trie that is a valid word.
 while node not in Spelling:
 # TODO: Check if a slice on Data[node:node+25] is faster
 node = next(i for i in range(node + 1, node + 25) if Data[i] != 0)
 return Spelling[node]

There might be two problems with code like this. First, it ignores case during query word search, so "jB" will come after "ja", rather than before. This isn't "asciibetical". Second, it will merge words that differ only in their capitalization: "jAmes" and "jaMes" point to the same place in the trie, so whatever one gets inserted last will win.

Question 9

You can place usernames and queries in the same array, and sort it:

L = sorted([(u.lower(), u) for u in usernames] + [(q.lower(), None) for q in queries])

Then just iterate backwards, for each query, the match candidate is the last username seen.

res = {}
candidate_lower, candidate = '', ''
for x, y in reversed(L):
 if y is None: # this is a query
 if candidate_lower.startswith(x):
 res[x] = candidate
 else:
 res[x] = '-1'
 else: # this is a unsername
 candidate_lower, candidate = x, y

With some extra trick to avoid completely dictionnaries, you get

def typeahead(usernames, queries):
 L = sorted(
 [(u.lower(), i) for i,u in enumerate(usernames)] +
 [(q.lower(), -i) for i,q in enumerate(reversed(queries),1)],
 reverse=True)
 candidate_lower, candidate_idx = '', 0
 for x, i in L:
 if i < 0: # this is a query
 queries[i] = usernames[candidate_idx] if candidate_lower.startswith(x) else '-1'
 else: # this is a unsername
 candidate_lower, candidate_idx = x, i
 print('\n'.join(queries))

The complexity is asymptotically \$O\big((U+Q)log(U+Q)\big)\$ but using a simple data structure makes the hidden constants small.

Antti Haapala Antti Haapala 7125 silver badges12 bronze badges · Accepted Answer · 2014-09-16 17:49:12Z

Do it the simple way: make the usernames into a list of tuples (username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd get

index = [("james", "james"),
 ("jblank", "jBlank")]

Now, use the bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to

i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
 found = index[i]
 if found[0].startswith(prefix):
 return found[1]
return '-1'

The returned value is tuple ('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.

Adapting this into a class one gets:

from bisect import bisect_left
names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]
class Index(object):
 def __init__(self, words):
 self.index = [ (w.lower(), w) for w in words ]
 self.index.sort()
 def by_prefix(self, prefix):
 """Return lexicographically smallest word that starts with a given
 prefix.
 """
 prefix = prefix.lower()
 i = bisect_left(self.index, (prefix, ''))
 if 0 <= i < len(self.index):
 found = self.index[i]
 if found[0].startswith(prefix):
 return found[1]
 return '-1'
def typeahead(usernames, queries):
 """Given a list of users and queries,
 this function prints all valid users for these queries.
 Args:
 usernames: list of strings representing users.
 queries: list of strings representing (incomplete) input
 """
 users = Index(usernames)
 print("\n".join(users.by_prefix(q) for q in queries))
typeahead(names, queries)

This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.

In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!

class Index(object):
 def __init__(self, words):
 index = {}
 for w in sorted(words, key=str.lower, reverse=True):
 lw = w.lower()
 for i in range(1, len(lw) + 1):
 index[lw[:i]] = w
 self.index = index
 def by_prefix(self, prefix):
 """Return lexicographically smallest word that starts with a given
 prefix.
 """
 return self.index.get(prefix.lower(), '-1')

This is a very cool solution to the problem, I like the simplicity. However, it is still not fast enough. When removing the print statement and running it 100 times it takes an average of 168ms on my machine. My trie takes 238ms in the same conditions. So while it is about 30% faster, it still doesn't pass on the Talentbuddy machine. I wonder If it's actually possible to solve this problem much faster in Python. I think the trie could still be much faster if it wasn't using dicts as nodes. However, I can't think of a better alternative.
You did run all the queries on 1 already generated datastructure, yes? And no, trie cannot be much faster; bisect has constantish branching factor of 17 for 100000 usernames, your trie depends on the number of characters...
I cannot think anything that would be faster for lookups than my dictionary approach, yet it would be easier to generate than your trie. Even if I tried to micro-optimize the creation loop it did not have any significance.
I called the typeahead function 100 times from ipython with the %timeit function. Since typeahead generates the datastructure for each call I think the benchmark is accurate. I'm not saying I think a trie is easier than your approach, actually i really like the simplicity of it. However, I think an efficient trie implementation could be faster. As you stated before, lookups in a trie depend on the size of the query, which have an average length of 6 in the test data set. Bisecting the tree takes 17 operations for each query so it should be slower.
Hem, just noticed that you said: "typeahead generates the datastructure for each call I think the benchmark is accurate", of course you should not recreate the data structure for each lookup, that is not what lookup means...

Stack Exchange Network

Autocomplete Trie Optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Autocomplete Trie Optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions