I have a list of lists (many tokenised sentences). For anyone who doesn't know what tokenised sentences are, my list is like so:
list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange'], ... ]
I have a list of key words as well, key_words
.
For every sentence in list
, I want to check if it is in key_words
. Moreover, I want a single method to be applied to each sentence. Below is my working (but inefficient) code:
list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack']
def get_features(sentence, key_words):
return [word for word in sentence if word in key_words]
f = []
for sent in list1:
f.append(get_features(sent, key_words))
This is fine, but my dimensions are like so:
len(list1) = 45,000
len(key_words) = 35,000
This is of course inefficient, and I would like to find a faster way of doing this. Could dictionaries be utilised in some way? I was thinking of changing key_words
from a list to a dictionary of key:value = word:1. Then I could do something like
return [word for word in sentence if key_words[word] does not give error]
but I'm unsure how if does not give error
would be implemented. Doing this would allow O(1) access to words in key_words if they are actually in there, rather than having to search the whole list until it is found, with O(n).
1 Answer 1
In the question, and as suggested by Mathias Ettinger, the reasoning behind locating an \$O(1)\$ search time-complexity as opposed to the current \$O(n)\$ complexity is correct.
However, the best approach is to use the set data structure instead of the list structure. Sets have \$O(1)\$ search time complexity since they are implemented using hash tables (https://wiki.python.org/moin/TimeComplexity) and they are similar to a list conceptually, and thus make much more sense than complicating things by using a dictionary.
The code (with the large dimensions mentioned in the question) runs in under 10 seconds like so:
list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack']
def get_features(sentence, key_words):
return [word for word in sentence if word in key_words]
f = []
key_words = set(key_words)
for sent in list1:
f.append(get_features(sent, key_words))
-
4\$\begingroup\$ I don't know where your
key_words
come from, but if you define it askey_words = {'hello', 'name', 'stack'}
you won't needkey_words = set(key_words)
further on \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年03月15日 15:53:16 +00:00Commented Mar 15, 2018 at 15:53
Explore related questions
See similar questions with these tags.
key_words = set(['hello', 'name', 'stack'])
and see if it's fast enough. Also you shouldn't name variableslist
, as it is a type name. \$\endgroup\$set
s? \$\endgroup\$