6
\$\begingroup\$

Here is the code I have written to create an inverted index dictionary for a set of documents:

inv_indx = {i:[] for i in corpus_dict}
for word in corpus_dict:
 for i in range(len(docs)):
 if word in docs[i]:
 inv_indx[word].append(i)

docs is a list of sets of the words in various documents:

[{'once','upon','a','time',...},{'lorum','ipsum','time'...},...]

corpus_dict is a set of all the words that appear in any of the documents:

{'once','upon','a','time','lorum','ipsum',...}

inv_index becomes a dictionary with each word in the corpus_dict as a key for a list of the document ids that contain that word:

{'once':[0],'time':[0,1],...}

The problem is this becomes very slow if the number of documents gets too big. How can I make this code more efficient?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Mar 6, 2018 at 0:17
\$\endgroup\$

1 Answer 1

5
\$\begingroup\$

Suggestions

  • You check all words in all documents. Try iterate only over docs (without unnecessary checks)
  • instead of create empty inv_indx = {i:[] for i in corpus_dict} you can use defaultdict

Code

from collections import defaultdict
inv_indx = defaultdict(list)
for idx, text in enumerate(docs):
 for word in text:
 inv_indx[word].append(idx)
answered Mar 6, 2018 at 1:11
\$\endgroup\$
1
  • \$\begingroup\$ Wow that was so much faster. The defaultdict doesn't seem to have much impact on the perfomance though so I'll leave that out \$\endgroup\$ Commented Mar 6, 2018 at 14:57

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.