3
\$\begingroup\$

My code should compare two vectors saved as dictionary (two pickle files) and save the result into a pickle file too. This works but very slowly. For one compare result I'm waiting about 7:20 min. Because I have a lot of videos (exactly 2033) this prog will run about 10 days. This is too long. How can I speed up my code for Python 2.7?

import math
import csv
import pickle
from itertools import izip
global_ddc_file = 'E:/global_ddc.p'
io = 'E:/AV-Datensatz'
v_source = ''
def dot_product(v1, v2):
 return sum(map(lambda x: x[0] * x[1], izip(v1, v2))) # izip('ABCD', 'xy') --> Ax By
def cosine_measure(v1, v2):
 prod = dot_product(v1, v2)
 len1 = math.sqrt(dot_product(v1, v1))
 len2 = math.sqrt(dot_product(v2, v2))
 if (len1 * len2) <> 0:
 out = prod / (len1 * len2)
 else: out = 0
 return out
def findSource(v):
 v_id = "/"+v[0].lstrip("<http://av.tib.eu/resource/video").rstrip(">")
 v_source = io + v_id
 v_file = v_source + '/vector.p'
 source = [v_id, v_source, v_file]
 return source
def getVector(v, vectorCol):
 with open (v, 'rb') as f:
 try:
 vector_v = pickle.load(f)
 except: print 'file couldnt be loaded'
 tf_idf = []
 tf_idf = [vec[1][vectorCol] for vec in vector_v]
 return tf_idf
def compareVectors(v1, v2, vectorCol):
 v1_source = findSource(v1)
 v2_source = findSource(v2)
 V1 = getVector(v1_source[2], vectorCol)
 V2 = getVector(v2_source[2], vectorCol)
 sim = [v1_source[0], v2_source[0], cosine_measure(V1, V2)]
 return sim
#with open('videos_av_portal_cc_3.0_nur2bspStanford.csv', 'rb') as dataIn:
with open('videos_av_portal_cc_3.0_vollstaendig.csv', 'rb') as dataIn:
#with open('videos_av_portal_cc_3.0.csv', 'rb') as dataIn:
 try:
 reader = csv.reader(dataIn)
 v_source = []
 for row in reader:
 v_source.append(findSource(row))
 #print v_source
 for one in v_source:
 print one[1]
 compVec = []
 for another in v_source:
 if one <> another: 
 compVec.append(compareVectors(one, another, 3))
 compVec_sort = sorted(compVec, key=lambda cosim: cosim[2], reverse = True) 
 # save vector file for each video
 with open (one[1] + '/compare.p','wb') as f:
 pickle.dump(compVec_sort,f)
 finally:
 dataIn.close()
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Sep 5, 2017 at 15:46
\$\endgroup\$
2
  • 2
    \$\begingroup\$ Could you also provide an example of vectors you are comparing to make the problem reproducible? \$\endgroup\$ Commented Sep 5, 2017 at 15:54
  • \$\begingroup\$ You might want to consider testing it with numpy for the dot product. \$\endgroup\$ Commented Sep 5, 2017 at 17:42

2 Answers 2

3
\$\begingroup\$

One suggestion is to use numpy to vectorize your dot product.

>>> timeit.timeit('dot_product(range(10000),range(10000))', setup='from itertools import izip\ndef dot_product(v1, v2):\n return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))', number=1000)
2.666857957839966
>>> timeit.timeit('np.dot(range(10000),range(10000))', setup='import numpy as np', number=1000)
0.9193508625030518

Another suggestion is to use multiple threads or processes to run multiple compare results at the same time. The libraries threading or multiprocessing would be useful here.

answered Sep 5, 2017 at 17:46
\$\endgroup\$
1
\$\begingroup\$

Some things I noticed

  • you could omit math.sqrt() and work with the dot product directly as you use it for sorting only.
  • you compare v1 with v2 but also v2 with v1. that gives you the same result twice.
  • when comparing v1 with all others you could keep the dot product for v1*v1
  • or even better prepare all "length" dot products vx*vx beforehand
answered Sep 7, 2017 at 10:55
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.