I am working on the following script for python2.7 which "works" on small files.
I ran a sample on an input file of 188kB and it took approx. 1.15 min to complete. However, I need to process a 5GB file using this script and I did the math, it will take 11.48 years to finish it the way it is now.
sample input1
aba_transit_number com
abaca plt|sub|sub|sub
abacus art|art
abalone anm
abamp qud
sample input2
zoonosis-n of+n-j+n-the-development-n
zoonosis-n of+n-j+n-the-j-collection-n 1
zoonosis-n of+n-j+n-the-j-success-n 1
Can someone provide me insight on how to optimize my script for computation speed??
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import division
from collections import defaultdict, Counter
import codecs
import random
mapping = dict()
#### takes as input a file with the following input1:
with codecs.open ("input1", "rb", "utf-8") as oSenseFile:
for line in oSenseFile:
concept, conceptClass = line.split()
mapping[concept + '-n'] = conceptClass
lemmas = set()
#### takes as input2 a file with the following format
with codecs.open('input2', "rb", "utf-8") as oIndexFile:
for line in oIndexFile:
lemma = line.split()[0]
if lemma in mapping.keys():
lemmas.add(lemma)
### randomly splits input2 into 2 files -- 80% and 20%
# -- and prints the 20% directly into out 2 for the other 80%
# --- it matches each 1st column in input2 with the first column in input 1
# -- if it is a match - it replaces it with the corresponding value in Col2 of Input1
# --- if there is more than one volume in Col2 of Input 1
# -- it prints all of the possible combinations and divides the freq (Col4 in Input2)
# by the number of values present
training_lemmas = random.sample(lemmas, int(len(lemmas) * 0.8))
classFreqs = defaultdict(lambda: Counter())
with codecs.open('out1', 'wb', 'utf-8') as testOutfile:
with codecs.open('input2', "rb", "utf-8") as oIndexFile:
for line in oIndexFile:
lemmaTAR, slot, filler, freq = line.split()
if lemmaTAR in training_lemmas:
senses = mapping[lemmaTAR].split(u'|')
for sense in senses:
classFreqs[sense][tuple([slot, filler])] += int(freq) / len(senses)
elif lemmaTAR in lemmas:
testOutfile.write(line)
with codecs.open('out2', 'wb', 'utf-8') as oOutFile:
for sense in sorted(classFreqs.keys()):
for slotfill in classFreqs[sense].keys():
string_slotfill = '\t'.join(list(slotfill))
outstring = '\t'.join([sense, string_slotfill, str(classFreqs[sense][slotfill])])
oOutFile.write(outstring + '\n')
1 Answer 1
Remove all usages of the keys method. Note, this was already mentioned in the comments to your question, but it seems to have mostly done the trick for your problem.
.keys()
! \$\endgroup\$training_lemmas
aset
. \$\endgroup\$