Compare two files and save the result into a pickle file

Question 1

My code should compare two vectors saved as dictionary (two pickle files) and save the result into a pickle file too. This works but very slowly. For one compare result I'm waiting about 7:20 min. Because I have a lot of videos (exactly 2033) this prog will run about 10 days. This is too long. How can I speed up my code for Python 2.7?

import math
import csv
import pickle
from itertools import izip
global_ddc_file = 'E:/global_ddc.p'
io = 'E:/AV-Datensatz'
v_source = ''
def dot_product(v1, v2):
 return sum(map(lambda x: x[0] * x[1], izip(v1, v2))) # izip('ABCD', 'xy') --> Ax By
def cosine_measure(v1, v2):
 prod = dot_product(v1, v2)
 len1 = math.sqrt(dot_product(v1, v1))
 len2 = math.sqrt(dot_product(v2, v2))
 if (len1 * len2) <> 0:
 out = prod / (len1 * len2)
 else: out = 0
 return out
def findSource(v):
 v_id = "/"+v[0].lstrip("<http://av.tib.eu/resource/video").rstrip(">")
 v_source = io + v_id
 v_file = v_source + '/vector.p'
 source = [v_id, v_source, v_file]
 return source
def getVector(v, vectorCol):
 with open (v, 'rb') as f:
 try:
 vector_v = pickle.load(f)
 except: print 'file couldnt be loaded'
 tf_idf = []
 tf_idf = [vec[1][vectorCol] for vec in vector_v]
 return tf_idf
def compareVectors(v1, v2, vectorCol):
 v1_source = findSource(v1)
 v2_source = findSource(v2)
 V1 = getVector(v1_source[2], vectorCol)
 V2 = getVector(v2_source[2], vectorCol)
 sim = [v1_source[0], v2_source[0], cosine_measure(V1, V2)]
 return sim
#with open('videos_av_portal_cc_3.0_nur2bspStanford.csv', 'rb') as dataIn:
with open('videos_av_portal_cc_3.0_vollstaendig.csv', 'rb') as dataIn:
#with open('videos_av_portal_cc_3.0.csv', 'rb') as dataIn:
 try:
 reader = csv.reader(dataIn)
 v_source = []
 for row in reader:
 v_source.append(findSource(row))
 #print v_source
 for one in v_source:
 print one[1]
 compVec = []
 for another in v_source:
 if one <> another: 
 compVec.append(compareVectors(one, another, 3))
 compVec_sort = sorted(compVec, key=lambda cosim: cosim[2], reverse = True) 
 # save vector file for each video
 with open (one[1] + '/compare.p','wb') as f:
 pickle.dump(compVec_sort,f)
 finally:
 dataIn.close()

Question 2

Could you also provide an example of vectors you are comparing to make the problem reproducible?

Question 3

You might want to consider testing it with numpy for the dot product.

Question 4

One suggestion is to use numpy to vectorize your dot product.

>>> timeit.timeit('dot_product(range(10000),range(10000))', setup='from itertools import izip\ndef dot_product(v1, v2):\n return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))', number=1000)
2.666857957839966
>>> timeit.timeit('np.dot(range(10000),range(10000))', setup='import numpy as np', number=1000)
0.9193508625030518

Another suggestion is to use multiple threads or processes to run multiple compare results at the same time. The libraries threading or multiprocessing would be useful here.

Question 5

Some things I noticed

you could omit math.sqrt() and work with the dot product directly as you use it for sorting only.
you compare v1 with v2 but also v2 with v1. that gives you the same result twice.
when comparing v1 with all others you could keep the dot product for v1*v1
or even better prepare all "length" dot products vx*vx beforehand

Evan Phibbs Evan Phibbs 1313 bronze badges · Answer 1 · 2017-09-05 17:46:37Z

One suggestion is to use numpy to vectorize your dot product.

>>> timeit.timeit('dot_product(range(10000),range(10000))', setup='from itertools import izip\ndef dot_product(v1, v2):\n return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))', number=1000)
2.666857957839966
>>> timeit.timeit('np.dot(range(10000),range(10000))', setup='import numpy as np', number=1000)
0.9193508625030518

Another suggestion is to use multiple threads or processes to run multiple compare results at the same time. The libraries threading or multiprocessing would be useful here.

stefan stefan 2,9948 silver badges17 bronze badges · Answer 2 · 2017-09-07 10:55:58Z

Some things I noticed

you could omit math.sqrt() and work with the dot product directly as you use it for sorting only.
you compare v1 with v2 but also v2 with v1. that gives you the same result twice.
when comparing v1 with all others you could keep the dot product for v1*v1
or even better prepare all "length" dot products vx*vx beforehand

Stack Exchange Network

Compare two files and save the result into a pickle file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compare two files and save the result into a pickle file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions