6
\$\begingroup\$

I have a list of lists (many tokenised sentences). For anyone who doesn't know what tokenised sentences are, my list is like so:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange'], ... ]

I have a list of key words as well, key_words.

For every sentence in list, I want to check if it is in key_words. Moreover, I want a single method to be applied to each sentence. Below is my working (but inefficient) code:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack'] 
def get_features(sentence, key_words):
 return [word for word in sentence if word in key_words]
f = []
for sent in list1:
 f.append(get_features(sent, key_words))

This is fine, but my dimensions are like so:

len(list1) = 45,000
len(key_words) = 35,000

This is of course inefficient, and I would like to find a faster way of doing this. Could dictionaries be utilised in some way? I was thinking of changing key_words from a list to a dictionary of key:value = word:1. Then I could do something like

return [word for word in sentence if key_words[word] does not give error]

but I'm unsure how if does not give error would be implemented. Doing this would allow O(1) access to words in key_words if they are actually in there, rather than having to search the whole list until it is found, with O(n).

asked Mar 15, 2018 at 15:10
\$\endgroup\$
4
  • \$\begingroup\$ Try key_words = set(['hello', 'name', 'stack']) and see if it's fast enough. Also you shouldn't name variables list, as it is a type name. \$\endgroup\$ Commented Mar 15, 2018 at 15:30
  • \$\begingroup\$ You're right about your O(1) vs O(n) reasonning, but wrong about the data structure. Have you considered using sets? \$\endgroup\$ Commented Mar 15, 2018 at 15:30
  • \$\begingroup\$ Rather then the one-liner, break that down into multiple lines, and look for where you need to catch the KeyError Exception \$\endgroup\$ Commented Mar 15, 2018 at 15:31
  • \$\begingroup\$ The set data structure works excellently! Thanks, I've used your ideas to implement efficient code and I have posted an answer below. \$\endgroup\$ Commented Mar 15, 2018 at 15:42

1 Answer 1

4
\$\begingroup\$

In the question, and as suggested by Mathias Ettinger, the reasoning behind locating an \$O(1)\$ search time-complexity as opposed to the current \$O(n)\$ complexity is correct.

However, the best approach is to use the set data structure instead of the list structure. Sets have \$O(1)\$ search time complexity since they are implemented using hash tables (https://wiki.python.org/moin/TimeComplexity) and they are similar to a list conceptually, and thus make much more sense than complicating things by using a dictionary.

The code (with the large dimensions mentioned in the question) runs in under 10 seconds like so:

list1 = [['hello', 'my', 'name'], ['this', 'is', 'stack', 'exchange']]
key_words = ['hello', 'name', 'stack'] 
def get_features(sentence, key_words):
 return [word for word in sentence if word in key_words]
f = []
key_words = set(key_words)
for sent in list1:
 f.append(get_features(sent, key_words))
alecxe
17.5k8 gold badges52 silver badges93 bronze badges
answered Mar 15, 2018 at 15:42
\$\endgroup\$
1
  • 4
    \$\begingroup\$ I don't know where your key_words come from, but if you define it as key_words = {'hello', 'name', 'stack'} you won't need key_words = set(key_words) further on \$\endgroup\$ Commented Mar 15, 2018 at 15:53

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.