Creating an inverted index from text documents

Question 1

I am working on an information retrieval project, where I have to process a ~1.5 GB text data and create a Dictionary (words, document frequency) and posting list (document id, term frequency). According to the professor, it should take around 10-15 minutes. But my code is running for more than 8 hours now! I tried a smaller dataset (~35 MB) and it took 5 hours to process.

I am a newbie in python and I think it is taking so long because i have created many python dictionaries and lists in my code. I tried to use generator, but I am not sure how to work around with it.

file = open(filename, 'rt')
text = file.read()
file.close()
p = r'<P ID=\d+>.*?</P>' 
tag = RegexpTokenizer(p)
passage = tag.tokenize(text)
doc_re = re.compile(r"<P ID=(\d+)>")
def process_data(docu):
 tokens = RegexpTokenizer(r'\w+') 
 lower_tokens = [word.lower() for word in tokens.tokenize(docu)] #convert to lower case 
 table = str.maketrans('','', string.punctuation)
 stripped = [w.translate(table) for w in lower_tokens] #remove punctuation 
 alpha = [word for word in stripped if word.isalpha()] #remove tokens that are not alphabetic 
 stopwordlist = stopwords.words('english')
 stopped = [w for w in alpha if not w in stopwordlist] #remove stopwords
 return stopped
data = {} #dictionary: key = Doc ID, value: word/term
for doc in passage:
 group_docID = doc_re.match(doc)
 docID = group_docID.group(1)
 tokens = process_data(doc)
 data[docID] = list(set(tokens))
vocab = [item for i in data.values() for item in i] #all words in the dataset
total_vocab = list(set(vocab)) #unique word/vocbulary for the whole dataset
total_vocab.sort()
print('Document Size = ', len(data)) #no. of documents
print('Collection Size = ', len(vocab)) #no. of words
print('Vocabulary Size= ', len(total_vocab)) #no. of unique words
inv_index = {} #dictionary: key =word/term, value: [docid, termfrequency]
for x in total_vocab:
 for y, z in data.items():
 if x in z:
 wordfreq = z.count(x)
 inv_index.setdefault(x, []).append((int(y), wordfreq)) 
flattend = [item for tag in inv_index.values() for item in tag] #[(docid, tf)]
posting = [item for tag in flattend for item in tag ] #[docid, tf]
#document frequency for each vocabulary/words
doc_freq=[]
for k,v in inv_index.items():
 freq1=len([item for item in v if item])
 doc_freq.append((freq1))
#offset value of each vocabulary/words
offset = []
offset1=0
for i in range(len(doc_freq)):
 if i>0:
 offset1 =offset1 + (doc_freq[i-1]*2)
 offset.append((offset1))
#create dcitionary of words, document frequency and offset
dictionary = {}
for i in range(len(total_vocab)):
 dictionary[total_vocab[i]]=(doc_freq[i],offset[i])
#dictionary of word, inverse document frequency
idf = {}
for i in range(len(dictionary)):
 a = np.log2(len(data)/doc_freq[i])
 idf[total_vocab[i]] = a
with open('dictionary.json', 'w') as f:
 json.dump(dictionary,f)
with open('idf.json', 'w') as f:
 json.dump(idf, f)
binary_file = open('binary_file.txt', 'wb')
for i in range(0, len(posting)):
 binary_int = (posting[i]).to_bytes(4, byteorder = 'big')
 #print(binary_int)
 binary_file.write(binary_int)
binary_file.close()

Could someone please help me to rewrite this code so that it becomes more computationally and time efficient?

There are around 57982 documents like that. Input File:

<P ID=2630932>
Background
Adrenal cortex oncocytic carcinoma (AOC) represents an exceptional 
pathological entity, since only 22 cases have been documented in the 
literature so far.
Case presentation
Our case concerns a 54-year-old man with past medical history of right 
adrenal excision with partial hepatectomy, due to an adrenocortical 
carcinoma. The patient was admitted in our hospital to undergo surgical 
resection of a left lung mass newly detected on chest Computed Tomography 
scan. The histological and immunohistochemical study revealed a metastatic 
AOC. Although the patient was given mitotane orally in adjuvant basis, he 
experienced relapse with multiple metastases in the thorax twice.....
<\P>

I am trying to tokenize each document by word and store document frequency of each word in a dictionary. Trying to save it in json file. Dictionary

word document_frequency offset
medical 2500 3414
research 320 4200

Also, generating a index where each word has a posting list of document ID and term frequency

medical (2630932, 20), (2795320, 2), (26350421, 31).... 
research (2783546, 243), (28517364, 310)....

and then save this postings in a binary file:

2630932 20 2795320 2 2635041 31....

with an offset value for each word. SO that when i load the posting list from disk, i could use seek function to get the posting for each corresponding word.

Question 2

Please provide a sample of the data, fill in the missing import statements and provide examples of input and output to help reviewers understand what you want to achieve with this code.

Question 3

I have edited my question with proper input output format

Question 4

Your imports are still missing. What is RegexpTokenizer?

Question 5

There are profiling utilities available with Python, as part of the standard distribution. However, they report their results by the function the lines are in. As a first step towards profiling, I suggest you break your code into a series of functions. One per paragraph would do -- just take the comment at the top of the paragraph as the function name, and call them in order.

Question 6

One way to probably speed this up a bit is to use generator expressions. Currently your process_data function has many list comprehensions after another. Each of them results in a list in memory, but you only care about the end result. In addition, you call set on it directly afterwards, so include that in the function itself. I also extracted some constants from the function, no need to always redefine them, and made the stopwords a set so in is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$.

STOPWORDS = set(stopwords.words('english'))
TOKENIZER = RegexpTokenizer(r'\w+')
TABLE = str.maketrans('', '', string.punctuation)
def process_data(docu):
 # convert to lower case 
 lower_tokens = (word.lower() for word in tokens.TOKENIZER(docu))
 #remove punctuation
 stripped = (w.translate(TABLE) for w in lower_tokens)
 #remove tokens that are not alphabetic 
 alpha = (word for word in stripped if word.isalpha())
 # remove stopwords
 stopped = (w for w in alpha if w not in STOPWORDS)
 return set(stopped)

I would also keep data[docID] a set, because you later check for in there as well.

Similarly, you can directly build a set using a set comprehension, instead of creating a list, and then putting it into a set. This way you can turn

vocab = [item for i in data.values() for item in i] #all words in the dataset
total_vocab = list(set(vocab)) #unique word/vocbulary for the whole dataset
total_vocab.sort()

into

total_vocab = sorted({item for i in data.values() for item in i})

Incidentally, wordfreq = z.count(x) should always give you 1, because you made sure before that z only has unique words.

Instead of inv_index being a normal dictionary and having to use inv_index.setdefault(x, []).append((int(y), wordfreq)), just make it a collections.defaultdict(list), so you can just do inv_index.append((int(y), wordfreq)).

Question 7

Assuming that RegexpTokenizer does what it claims to do, a lot of that function could be discarded just by using a different regexp.

Question 8

Any example for using a different regexp? I am a newbie in python!

Question 9

Thank you! converting lists to set really saved some time!

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-10-13 10:26:33Z

One way to probably speed this up a bit is to use generator expressions. Currently your process_data function has many list comprehensions after another. Each of them results in a list in memory, but you only care about the end result. In addition, you call set on it directly afterwards, so include that in the function itself. I also extracted some constants from the function, no need to always redefine them, and made the stopwords a set so in is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$.

STOPWORDS = set(stopwords.words('english'))
TOKENIZER = RegexpTokenizer(r'\w+')
TABLE = str.maketrans('', '', string.punctuation)
def process_data(docu):
 # convert to lower case 
 lower_tokens = (word.lower() for word in tokens.TOKENIZER(docu))
 #remove punctuation
 stripped = (w.translate(TABLE) for w in lower_tokens)
 #remove tokens that are not alphabetic 
 alpha = (word for word in stripped if word.isalpha())
 # remove stopwords
 stopped = (w for w in alpha if w not in STOPWORDS)
 return set(stopped)

I would also keep data[docID] a set, because you later check for in there as well.

Similarly, you can directly build a set using a set comprehension, instead of creating a list, and then putting it into a set. This way you can turn

vocab = [item for i in data.values() for item in i] #all words in the dataset
total_vocab = list(set(vocab)) #unique word/vocbulary for the whole dataset
total_vocab.sort()

into

total_vocab = sorted({item for i in data.values() for item in i})

Incidentally, wordfreq = z.count(x) should always give you 1, because you made sure before that z only has unique words.

Instead of inv_index being a normal dictionary and having to use inv_index.setdefault(x, []).append((int(y), wordfreq)), just make it a collections.defaultdict(list), so you can just do inv_index.append((int(y), wordfreq)).

Assuming that RegexpTokenizer does what it claims to do, a lot of that function could be discarded just by using a different regexp.
Any example for using a different regexp? I am a newbie in python!

Stack Exchange Network

Creating an inverted index from text documents

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating an inverted index from text documents

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions