Here is the code I have written to create an inverted index dictionary for a set of documents:
inv_indx = {i:[] for i in corpus_dict}
for word in corpus_dict:
for i in range(len(docs)):
if word in docs[i]:
inv_indx[word].append(i)
docs
is a list of sets of the words in various documents:
[{'once','upon','a','time',...},{'lorum','ipsum','time'...},...]
corpus_dict
is a set of all the words that appear in any of the documents:
{'once','upon','a','time','lorum','ipsum',...}
inv_index
becomes a dictionary with each word in the corpus_dict as a key for a list of the document ids that contain that word:
{'once':[0],'time':[0,1],...}
The problem is this becomes very slow if the number of documents gets too big. How can I make this code more efficient?
1 Answer 1
Suggestions
- You check all words in all documents. Try iterate only over
docs
(without unnecessary checks) - instead of create empty
inv_indx = {i:[] for i in corpus_dict}
you can use defaultdict
Code
from collections import defaultdict
inv_indx = defaultdict(list)
for idx, text in enumerate(docs):
for word in text:
inv_indx[word].append(idx)
-
\$\begingroup\$ Wow that was so much faster. The defaultdict doesn't seem to have much impact on the perfomance though so I'll leave that out \$\endgroup\$Joe– Joe2018年03月06日 14:57:22 +00:00Commented Mar 6, 2018 at 14:57