2
\$\begingroup\$

I am working on the following script for python2.7 which "works" on small files.

I ran a sample on an input file of 188kB and it took approx. 1.15 min to complete. However, I need to process a 5GB file using this script and I did the math, it will take 11.48 years to finish it the way it is now.

sample input1

aba_transit_number com
abaca plt|sub|sub|sub
abacus art|art
abalone anm
abamp qud

sample input2

zoonosis-n of+n-j+n-the-development-n 
zoonosis-n of+n-j+n-the-j-collection-n 1
zoonosis-n of+n-j+n-the-j-success-n 1

Can someone provide me insight on how to optimize my script for computation speed??

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 from __future__ import division
 from collections import defaultdict, Counter
 import codecs
 import random
 mapping = dict()
#### takes as input a file with the following input1:
 with codecs.open ("input1", "rb", "utf-8") as oSenseFile:
 for line in oSenseFile:
 concept, conceptClass = line.split()
 mapping[concept + '-n'] = conceptClass
 lemmas = set()
#### takes as input2 a file with the following format
 with codecs.open('input2', "rb", "utf-8") as oIndexFile:
 for line in oIndexFile: 
 lemma = line.split()[0]
 if lemma in mapping.keys():
 lemmas.add(lemma)
### randomly splits input2 into 2 files -- 80% and 20% 
# -- and prints the 20% directly into out 2 for the other 80% 
# --- it matches each 1st column in input2 with the first column in input 1 
# -- if it is a match - it replaces it with the corresponding value in Col2 of Input1 
# --- if there is more than one volume in Col2 of Input 1 
# -- it prints all of the possible combinations and divides the freq (Col4 in Input2) 
# by the number of values present 
 training_lemmas = random.sample(lemmas, int(len(lemmas) * 0.8))
 classFreqs = defaultdict(lambda: Counter())
 with codecs.open('out1', 'wb', 'utf-8') as testOutfile:
 with codecs.open('input2', "rb", "utf-8") as oIndexFile: 
 for line in oIndexFile:
 lemmaTAR, slot, filler, freq = line.split()
 if lemmaTAR in training_lemmas:
 senses = mapping[lemmaTAR].split(u'|')
 for sense in senses:
 classFreqs[sense][tuple([slot, filler])] += int(freq) / len(senses)
 elif lemmaTAR in lemmas:
 testOutfile.write(line)
 with codecs.open('out2', 'wb', 'utf-8') as oOutFile:
 for sense in sorted(classFreqs.keys()):
 for slotfill in classFreqs[sense].keys():
 string_slotfill = '\t'.join(list(slotfill))
 outstring = '\t'.join([sense, string_slotfill, str(classFreqs[sense][slotfill])])
 oOutFile.write(outstring + '\n')
Janne Karila
10.6k21 silver badges34 bronze badges
asked Nov 11, 2013 at 11:45
\$\endgroup\$
10
  • 2
    \$\begingroup\$ Don't write .keys()! \$\endgroup\$ Commented Nov 11, 2013 at 11:50
  • \$\begingroup\$ see updated question --- okay, simply removing .keys() will improve speed? \$\endgroup\$ Commented Nov 11, 2013 at 11:58
  • \$\begingroup\$ Yes, pretty much. See §3 of this answer for an explanation. \$\endgroup\$ Commented Nov 11, 2013 at 12:18
  • 1
    \$\begingroup\$ Make training_lemmas a set. \$\endgroup\$ Commented Nov 11, 2013 at 12:52
  • \$\begingroup\$ You mention a 5 GB file but you have two inputs. Which is large, or both? \$\endgroup\$ Commented Nov 11, 2013 at 13:04

1 Answer 1

2
\$\begingroup\$

Remove all usages of the keys method. Note, this was already mentioned in the comments to your question, but it seems to have mostly done the trick for your problem.

answered Nov 19, 2013 at 16:00
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.