Check if multiple lists contain an element from a large external list

Question 1

I have a list of lists (many tokenised sentences). For anyone who doesn't know what tokenised sentences are, my list is like so:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange'], ... ]

I have a list of key words as well, key_words.

For every sentence in list, I want to check if it is in key_words. Moreover, I want a single method to be applied to each sentence. Below is my working (but inefficient) code:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack'] 
def get_features(sentence, key_words):
 return [word for word in sentence if word in key_words]
f = []
for sent in list1:
 f.append(get_features(sent, key_words))

This is fine, but my dimensions are like so:

len(list1) = 45,000
len(key_words) = 35,000

This is of course inefficient, and I would like to find a faster way of doing this. Could dictionaries be utilised in some way? I was thinking of changing key_words from a list to a dictionary of key:value = word:1. Then I could do something like

return [word for word in sentence if key_words[word] does not give error]

but I'm unsure how if does not give error would be implemented. Doing this would allow O(1) access to words in key_words if they are actually in there, rather than having to search the whole list until it is found, with O(n).

Question 2

Try key_words = set(['hello', 'name', 'stack']) and see if it's fast enough. Also you shouldn't name variables list, as it is a type name.

Question 3

You're right about your O(1) vs O(n) reasonning, but wrong about the data structure. Have you considered using sets?

Question 4

Rather then the one-liner, break that down into multiple lines, and look for where you need to catch the KeyError Exception

Question 5

The set data structure works excellently! Thanks, I've used your ideas to implement efficient code and I have posted an answer below.

Question 6

In the question, and as suggested by Mathias Ettinger, the reasoning behind locating an \$O(1)\$ search time-complexity as opposed to the current \$O(n)\$ complexity is correct.

However, the best approach is to use the set data structure instead of the list structure. Sets have \$O(1)\$ search time complexity since they are implemented using hash tables (https://wiki.python.org/moin/TimeComplexity) and they are similar to a list conceptually, and thus make much more sense than complicating things by using a dictionary.

The code (with the large dimensions mentioned in the question) runs in under 10 seconds like so:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack'] 
def get_features(sentence, key_words):
 return [word for word in sentence if word in key_words]
f = []
key_words = set(key_words)
for sent in list1:
 f.append(get_features(sent, key_words))

Question 7

I don't know where your key_words come from, but if you define it as key_words = {'hello', 'name', 'stack'} you won't need key_words = set(key_words) further on

quanty quanty 2853 silver badges9 bronze badges · Answer 1 · 2018-03-15 15:42:11Z

In the question, and as suggested by Mathias Ettinger, the reasoning behind locating an \$O(1)\$ search time-complexity as opposed to the current \$O(n)\$ complexity is correct.

However, the best approach is to use the set data structure instead of the list structure. Sets have \$O(1)\$ search time complexity since they are implemented using hash tables (https://wiki.python.org/moin/TimeComplexity) and they are similar to a list conceptually, and thus make much more sense than complicating things by using a dictionary.

The code (with the large dimensions mentioned in the question) runs in under 10 seconds like so:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack'] 
def get_features(sentence, key_words):
 return [word for word in sentence if word in key_words]
f = []
key_words = set(key_words)
for sent in list1:
 f.append(get_features(sent, key_words))

I don't know where your key_words come from, but if you define it as key_words = {'hello', 'name', 'stack'} you won't need key_words = set(key_words) further on

Stack Exchange Network

Check if multiple lists contain an element from a large external list

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Check if multiple lists contain an element from a large external list

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions