4
\$\begingroup\$

I am working on an information retrieval project, where I have to process a ~1.5 GB text data and create a Dictionary (words, document frequency) and posting list (document id, term frequency). According to the professor, it should take around 10-15 minutes. But my code is running for more than 8 hours now! I tried a smaller dataset (~35 MB) and it took 5 hours to process.

I am a newbie in python and I think it is taking so long because i have created many python dictionaries and lists in my code. I tried to use generator, but I am not sure how to work around with it.

file = open(filename, 'rt')
text = file.read()
file.close()
p = r'<P ID=\d+>.*?</P>' 
tag = RegexpTokenizer(p)
passage = tag.tokenize(text)
doc_re = re.compile(r"<P ID=(\d+)>")
def process_data(docu):
 tokens = RegexpTokenizer(r'\w+') 
 lower_tokens = [word.lower() for word in tokens.tokenize(docu)] #convert to lower case 
 table = str.maketrans('','', string.punctuation)
 stripped = [w.translate(table) for w in lower_tokens] #remove punctuation 
 alpha = [word for word in stripped if word.isalpha()] #remove tokens that are not alphabetic 
 stopwordlist = stopwords.words('english')
 stopped = [w for w in alpha if not w in stopwordlist] #remove stopwords
 return stopped
data = {} #dictionary: key = Doc ID, value: word/term
for doc in passage:
 group_docID = doc_re.match(doc)
 docID = group_docID.group(1)
 tokens = process_data(doc)
 data[docID] = list(set(tokens))
vocab = [item for i in data.values() for item in i] #all words in the dataset
total_vocab = list(set(vocab)) #unique word/vocbulary for the whole dataset
total_vocab.sort()
print('Document Size = ', len(data)) #no. of documents
print('Collection Size = ', len(vocab)) #no. of words
print('Vocabulary Size= ', len(total_vocab)) #no. of unique words
inv_index = {} #dictionary: key =word/term, value: [docid, termfrequency]
for x in total_vocab:
 for y, z in data.items():
 if x in z:
 wordfreq = z.count(x)
 inv_index.setdefault(x, []).append((int(y), wordfreq)) 
flattend = [item for tag in inv_index.values() for item in tag] #[(docid, tf)]
posting = [item for tag in flattend for item in tag ] #[docid, tf]
#document frequency for each vocabulary/words
doc_freq=[]
for k,v in inv_index.items():
 freq1=len([item for item in v if item])
 doc_freq.append((freq1))
#offset value of each vocabulary/words
offset = []
offset1=0
for i in range(len(doc_freq)):
 if i>0:
 offset1 =offset1 + (doc_freq[i-1]*2)
 offset.append((offset1))
#create dcitionary of words, document frequency and offset
dictionary = {}
for i in range(len(total_vocab)):
 dictionary[total_vocab[i]]=(doc_freq[i],offset[i])
#dictionary of word, inverse document frequency
idf = {}
for i in range(len(dictionary)):
 a = np.log2(len(data)/doc_freq[i])
 idf[total_vocab[i]] = a
with open('dictionary.json', 'w') as f:
 json.dump(dictionary,f)
with open('idf.json', 'w') as f:
 json.dump(idf, f)
binary_file = open('binary_file.txt', 'wb')
for i in range(0, len(posting)):
 binary_int = (posting[i]).to_bytes(4, byteorder = 'big')
 #print(binary_int)
 binary_file.write(binary_int)
binary_file.close()

Could someone please help me to rewrite this code so that it becomes more computationally and time efficient?

There are around 57982 documents like that. Input File:

<P ID=2630932>
Background
Adrenal cortex oncocytic carcinoma (AOC) represents an exceptional 
pathological entity, since only 22 cases have been documented in the 
literature so far.
Case presentation
Our case concerns a 54-year-old man with past medical history of right 
adrenal excision with partial hepatectomy, due to an adrenocortical 
carcinoma. The patient was admitted in our hospital to undergo surgical 
resection of a left lung mass newly detected on chest Computed Tomography 
scan. The histological and immunohistochemical study revealed a metastatic 
AOC. Although the patient was given mitotane orally in adjuvant basis, he 
experienced relapse with multiple metastases in the thorax twice.....
<\P>

I am trying to tokenize each document by word and store document frequency of each word in a dictionary. Trying to save it in json file. Dictionary

word document_frequency offset
medical 2500 3414
research 320 4200

Also, generating a index where each word has a posting list of document ID and term frequency

medical (2630932, 20), (2795320, 2), (26350421, 31).... 
research (2783546, 243), (28517364, 310)....

and then save this postings in a binary file:

2630932 20 2795320 2 2635041 31....

with an offset value for each word. SO that when i load the posting list from disk, i could use seek function to get the posting for each corresponding word.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Oct 12, 2019 at 23:08
\$\endgroup\$
4
  • 3
    \$\begingroup\$ Please provide a sample of the data, fill in the missing import statements and provide examples of input and output to help reviewers understand what you want to achieve with this code. \$\endgroup\$ Commented Oct 12, 2019 at 23:45
  • \$\begingroup\$ I have edited my question with proper input output format \$\endgroup\$ Commented Oct 13, 2019 at 0:27
  • 1
    \$\begingroup\$ Your imports are still missing. What is RegexpTokenizer? \$\endgroup\$ Commented Oct 13, 2019 at 10:27
  • 2
    \$\begingroup\$ There are profiling utilities available with Python, as part of the standard distribution. However, they report their results by the function the lines are in. As a first step towards profiling, I suggest you break your code into a series of functions. One per paragraph would do -- just take the comment at the top of the paragraph as the function name, and call them in order. \$\endgroup\$ Commented Oct 13, 2019 at 14:23

1 Answer 1

1
\$\begingroup\$

One way to probably speed this up a bit is to use generator expressions. Currently your process_data function has many list comprehensions after another. Each of them results in a list in memory, but you only care about the end result. In addition, you call set on it directly afterwards, so include that in the function itself. I also extracted some constants from the function, no need to always redefine them, and made the stopwords a set so in is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$.

STOPWORDS = set(stopwords.words('english'))
TOKENIZER = RegexpTokenizer(r'\w+')
TABLE = str.maketrans('', '', string.punctuation)
def process_data(docu):
 # convert to lower case 
 lower_tokens = (word.lower() for word in tokens.TOKENIZER(docu))
 #remove punctuation
 stripped = (w.translate(TABLE) for w in lower_tokens)
 #remove tokens that are not alphabetic 
 alpha = (word for word in stripped if word.isalpha())
 # remove stopwords
 stopped = (w for w in alpha if w not in STOPWORDS)
 return set(stopped)

I would also keep data[docID] a set, because you later check for in there as well.

Similarly, you can directly build a set using a set comprehension, instead of creating a list, and then putting it into a set. This way you can turn

vocab = [item for i in data.values() for item in i] #all words in the dataset
total_vocab = list(set(vocab)) #unique word/vocbulary for the whole dataset
total_vocab.sort()

into

total_vocab = sorted({item for i in data.values() for item in i})

Incidentally, wordfreq = z.count(x) should always give you 1, because you made sure before that z only has unique words.

Instead of inv_index being a normal dictionary and having to use inv_index.setdefault(x, []).append((int(y), wordfreq)), just make it a collections.defaultdict(list), so you can just do inv_index.append((int(y), wordfreq)).

answered Oct 13, 2019 at 10:26
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Assuming that RegexpTokenizer does what it claims to do, a lot of that function could be discarded just by using a different regexp. \$\endgroup\$ Commented Oct 13, 2019 at 14:21
  • \$\begingroup\$ Any example for using a different regexp? I am a newbie in python! \$\endgroup\$ Commented Oct 13, 2019 at 16:32
  • 1
    \$\begingroup\$ Thank you! converting lists to set really saved some time! \$\endgroup\$ Commented Oct 13, 2019 at 18:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.