Count words with given prefix using Trie

Question 1

I have been trying to solve this HackerRank problem:

We're going to make our own Contacts application! The application must perform two types of operations:

add name, where is a string denoting a contact name. This must store as a new contact in the application.

find partial, where is a string denoting a partial name to search the application for. It must count the number of contacts starting with and print the count on a new line.

Given sequential add and find operations, perform each operation in order.

What I came up is this

class TrieNode():
 def __init__(self):
 self.children = {}
 self.word_count = 0
class Trie():
 def __init__(self):
 self.root = self.get_node()
 def get_node(self):
 return TrieNode()
 def add(self, key):
 crawl = self.root
 for char in key:
 if char not in crawl.children:
 crawl.children[char] = self.get_node()
 crawl = crawl.children[char]
 crawl.word_count += 1
 def get_prefix(self, key):
 crawl = self.root
 for char in key:
 if char not in crawl.children:
 return None
 crawl = crawl.children[char]
 return crawl
 def find(self, key):
 node = self.get_prefix(key)
 return node.word_count if node else 0

The code above is a modification of this from GeeksForGeeks. The modifications that I have done are keeping a word_count for every trie node. This denotes how many words start with the prefix ending in that node's letter. The other modification is using a hashmap to keep children rather than lists.

On HackerRank, 9/11 test cases pass, and the other 2 timeout. I can't figure out where I can optimize this further.

I know of a radix tree that is an optimized version of a normal trie, but that saves on space, rather than time, so I haven't explored that option.

Can anyone help me figure out where I can optimize for time in my code?

Thanks!

Question 2

You should always include the programming language tag to the question.

Question 3

I think the important thing to notice here is that you don't actually need a Trie. The partial queries are not just substrings of the names, they are prefixes. It should therefore be faster to just built a dictionary which directly counts how often each prefix occurs.

Something as simple as this is sufficient (just tested it, it passes all testcases):

from collections import defaultdict
def contacts(queries):
 prefixes = defaultdict(int)
 for op, name in queries:
 if op == "add":
 # maybe not the best/nicest way to iterate over slices,
 # but probably the easiest
 for i in range(1, len(name) + 1):
 prefixes[name[:i]] += 1
 else:
 # gives 0 if name is not in prefixes, no need for `get`
 yield prefixes[name]

This is \$\mathcal{O}(len(name))\$ for each add and \$\mathcal{O}(1)\$ for each find. And since the length of all names is less than 22, the for loop will be very fast. It does not even seem to matter that slicing the string might create additional copies.

In contrast, your code is also \$\mathcal{O}(len(name))\$ for each add, but it is also \$\mathcal{O}(len(name))\$ for each find, as far as I can tell. You need to actually traverse the Trie to find the count. Apparently that, maybe together with the overhead of looking up class methods, is enough to reach the time limit for one of the larger testcases.

Stylistically your code looks very good. The only thing to improve there is adding some docstrings to your methods which explain how to use them.

If you were worried about Python 2 backward compatibility you should inherit from object, but nowadays that is probably less of a concern (especially with one-off code for a coding challenge).

Question 4

Why defaultdict(int) instead of Counter?

Question 5

Thank you! I understand now. I have one follow up question though. Aren't tries' biggest use case finding prefixes? I'm asking because you said not just substrings of the names, they are prefixes, which makes me believe a hashmap is a better implementation for prefix finding rather than a trie. Or is it only true in this case because of max limit on the length of the input strings?

Question 6

@SidharthSamant I think you might be right. In that case it is probably because the input size is fixed. With your trie you iterate through the nodes to get to the final node, and this iteration happens in Python. With the dictionary you lookup a value in a hashmap, which happens in C. And since only the counts are needed, there is no extra memory needed (if each value contained all words seen so far with that prefix, it would take a lot more space than a trie).

Question 7

@PeterTaylor Because I did not need any of the other features of Counter and therefore did not think of it. However, now that you mention it counter.update(name[:i+1] for i in range(len(name))) could be slightly more elegant than my for loop.

Graipher Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-08-27 08:25:45Z

I think the important thing to notice here is that you don't actually need a Trie. The partial queries are not just substrings of the names, they are prefixes. It should therefore be faster to just built a dictionary which directly counts how often each prefix occurs.

Something as simple as this is sufficient (just tested it, it passes all testcases):

from collections import defaultdict
def contacts(queries):
 prefixes = defaultdict(int)
 for op, name in queries:
 if op == "add":
 # maybe not the best/nicest way to iterate over slices,
 # but probably the easiest
 for i in range(1, len(name) + 1):
 prefixes[name[:i]] += 1
 else:
 # gives 0 if name is not in prefixes, no need for `get`
 yield prefixes[name]

This is \$\mathcal{O}(len(name))\$ for each add and \$\mathcal{O}(1)\$ for each find. And since the length of all names is less than 22, the for loop will be very fast. It does not even seem to matter that slicing the string might create additional copies.

In contrast, your code is also \$\mathcal{O}(len(name))\$ for each add, but it is also \$\mathcal{O}(len(name))\$ for each find, as far as I can tell. You need to actually traverse the Trie to find the count. Apparently that, maybe together with the overhead of looking up class methods, is enough to reach the time limit for one of the larger testcases.

Stylistically your code looks very good. The only thing to improve there is adding some docstrings to your methods which explain how to use them.

If you were worried about Python 2 backward compatibility you should inherit from object, but nowadays that is probably less of a concern (especially with one-off code for a coding challenge).

Thank you! I understand now. I have one follow up question though. Aren't tries' biggest use case finding prefixes? I'm asking because you said not just substrings of the names, they are prefixes, which makes me believe a hashmap is a better implementation for prefix finding rather than a trie. Or is it only true in this case because of max limit on the length of the input strings?
@SidharthSamant I think you might be right. In that case it is probably because the input size is fixed. With your trie you iterate through the nodes to get to the final node, and this iteration happens in Python. With the dictionary you lookup a value in a hashmap, which happens in C. And since only the counts are needed, there is no extra memory needed (if each value contained all words seen so far with that prefix, it would take a lot more space than a trie).
@PeterTaylor Because I did not need any of the other features of Counter and therefore did not think of it. However, now that you mention it counter.update(name[:i+1] for i in range(len(name))) could be slightly more elegant than my for loop.

Stack Exchange Network

Count words with given prefix using Trie

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count words with given prefix using Trie

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions