I coded this Support Vector Regression (SVR) myself following some equations in a journal (see here, or here (not in English)). The loss function used by the journal and the code below is mean absolute percentage error (MAPE).
I need to make it run faster because I will use this function 1600 times for evaluations. If I use this code it will be running for a couple days or even a week for one run.
How can I make it run faster? I'm beginner in Python (first time coding in Python).
This example stock market data I use: TLKM.CSV
You can see the code here: SVRpython.py or below:
import csv
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import random as Rand
from pandas import DataFrame
from sklearn.model_selection import train_test_split
import pdb
import time
nstart=time.process_time()
# pdb.set_trace()
# import IPython as IP
data = pd.read_csv("TLKM.csv")
def Distancetrain(d3, d2, d1):
d=len(d3.index)
harray=[]
for i in range(d):
harray.clear()
for j in range(d):
harray.append(((d3.iloc[i]-d3.iloc[j])**2) + ((d2.iloc[i]-d2.iloc[j])**2) + ((d1.iloc[i]-d1.iloc[j])**2))
if i < 1:
distancedata=pd.DataFrame(harray)
else:
distancedata[i]=harray
print("distance train")
print(time.process_time()-nstart)
return distancedata
def Distancetest(d3train, d2train, d1train, d3test, d2test, d1test):
dtrain=len(d3train.index)
dtest=len(d3test.index)
harray=[]
for i in range(dtrain):
harray.clear()
for j in range(dtest):
harray.append(((d3test.iloc[j]-d3train.iloc[i])**2) + ((d2test.iloc[j]-d2train.iloc[i])**2) + ((d1test.iloc[j]-d1train.iloc[i])**2))
if i < 1:
distancedata=pd.DataFrame(harray)
else:
distancedata[i]=harray
print("distance test")
print(time.process_time()-nstart)
return distancedata
def Hessian(dfdistance, sigma, lamda):
d=len(dfdistance.index)
col=len(dfdistance.columns)
hes = np.array([], dtype=np.float64).reshape(0,col)
tampung = [[0] * col]
sig2= 2*(sigma**2)
lam2=lamda**2
for i in range(d):
for j in range(col):
tampung[0][j]=np.exp(-1*((dfdistance.iloc[i][j])/(sig2))) + (lam2)
hes=np.vstack([hes, tampung])
dfhessian=pd.DataFrame(hes)
print("hessian")
print(time.process_time()-nstart)
return dfhessian
def Seqlearn(y, dfhessian, gamma, eps, c, itermaxsvr):
d=len(dfhessian.index)
a = [[0] * d]
a_s = [[0] * d]
la = [[0] * d]
la_s = [[0] * d]
E = np.array([], dtype=np.float64).reshape(0,d)
Etemp = [[0] * d]
da_s = np.array([], dtype=np.float64).reshape(0,d)
da = np.array([], dtype=np.float64).reshape(0,d)
dat_s = [[0] * d]
dat = [[0] * d]
tempas = [[0] * d]
tempa = [[0] * d]
for i in range(itermaxsvr):
for j in range(d):
Rijhelp=0
for k in range(d):
Rijhelp = Rijhelp + ((a_s[i][k] - a[i][k])*(dfhessian.iloc[j][k]))
Etemp[0][j]= y.iloc[j] - Rijhelp
E=np.vstack([E, Etemp])
for l in range(d):
dat_s[0][l]=min(max(gamma*(E[i][l] - eps), -1*(a_s[i][l])), (c - a_s[i][l]))
dat[0][l]=min(max(gamma*(-(E[i][l]) - eps), -1*(a[i][l])), (c - a[i][l]))
tempas[0][l]= a_s[i][l] + dat_s[0][l]
tempa[0][l]= a[i][l] + dat[0][l]
da_s=np.vstack([da_s, dat_s])
da=np.vstack([da, dat])
a=np.vstack([a, tempa])
a_s=np.vstack([a_s, tempas])
la=tempa
la_s=tempas
# (|da|<eps and |das|<eps ) or max iterasi
dat_abs=max([abs(xdat) for xdat in dat[0]])
dat_s_abs=max([abs(xdats) for xdats in dat_s[0]])
print(dat_abs)
print(dat_s_abs)
if (dat_abs < eps) and (dat_s_abs < eps):
print(time.process_time()-nstart)
break
print(time.process_time()-nstart)
return la, la_s
def Predictf(a, a_s, dfhessian):
# predict = sum ((a_s[0][k]-a[0][k]) * hessian[j][k])
row=len(dfhessian.index)
col=len(dfhessian.columns)
for j in range(row):
datax=0
for k in range(col):
datax= datax + ((a_s[0][k] - a[0][k])*(dfhessian.iloc[j][k]))
if (j == 0):
dataxm=datax
elif (j > 0):
dataxm=np.vstack([dataxm, datax])
print("predict")
print(time.process_time()-nstart)
return dataxm
def Normalization(datain, closemax, closemin):
dataout=(datain - closemin)/(closemax - closemin)
return dataout
def SVRf(df, closemax, closemin, c, lamda, eps, sigma, gamma, itermaxsvr):
result = df.assign(Day_3 = Normalization(df.Day_3, closemax, closemin), Day_2=Normalization(df.Day_2, closemax, closemin), Day_1=Normalization(df.Day_1, closemax, closemin), Actual=Normalization(df.Actual, closemax, closemin))
X_train, X_test, y_train, y_test, d3_train, d3_test, d2_train, d2_test, d1_train, d1_test, date_train, date_test = train_test_split(result['Index'], result['Actual'], result['Day_3'], result['Day_2'], result['Day_1'], result['Date'], train_size=0.9, test_size=0.1, shuffle=False)
distancetrain=Distancetrain(d3_train, d2_train, d1_train)
mhessian=Hessian(distancetrain, sigma, lamda)
a, a_s = Seqlearn(y_train, mhessian, gamma, eps, c, itermaxsvr)
distancetest=Distancetest(d3_train, d2_train, d1_train, d3_test, d2_test, d1_test)
testhessian=Hessian(distancetest, sigma, lamda)
predict = Predictf(a, a_s, testhessian)
hasilpre=pd.DataFrame()
tgltest = date_test
tgltest.reset_index(drop=True, inplace=True)
hasilpre['Tanggal'] = tgltest
hasilpre['Close'] = predict
deresult = hasilpre.assign(Close=(hasilpre.Close * (closemax - closemin) + closemin))
n=len(y_test)
aktualtest = (y_test * (closemax - closemin)) + closemin
aktualtest.reset_index(inplace=True, drop=True)
dpredict = pd.Series(deresult['Close'], index=deresult.index)
hasil = aktualtest - dpredict
hasil1 = (hasil / aktualtest).abs()
suma = hasil1.sum()
mape = (1/n) * suma
print("MAPE")
print(mape)
fitness = 1/(1+mape)
print(fitness)
return fitness, mape, hasilpre
Closemax=data['Close'].max()
Closemin=data['Close'].min()
print(Closemax)
print(Closemin)
day3 = data['Close'][0:((-1)-2)]
day2 = data['Close'][1:((-1)-1)]
day2.index = day2.index - 1
day1 = data['Close'][2:((-1)-0)]
day1.index = day1.index - 2
dayact = data['Close'][3:]
dayact.index = dayact.index - 3
dateact = data['Tanggal'][3:]
dateact.index = dateact.index - 3
mydata = pd.DataFrame({'Index':data['Index'][0:((-1)-2)], 'Date':dateact, 'Day_3':day3, 'Day_2':day2, 'Day_1':day1, 'Actual':dayact})
print("data proses",time.process_time()-nstart)
Lamda=0.09
C=200
Eps=0.0013
Sigma=0.11
Gamma=0.004
Itermaxsvr=1000
SVRf(mydata, Closemax, Closemin, C, Lamda, Eps, Sigma, Gamma, Itermaxsvr)
nstop=time.process_time()
print(nstop-nstart)
-
2\$\begingroup\$ Welcome to CodeReview@SE. Try and improve the title: have it tell coders not into machine learning what the code presented is to accomplish, see How do I ask a Good Question? If you can, add information where most time is spent, a meaningful execution time profile. \$\endgroup\$greybeard– greybeard2020年02月04日 15:00:36 +00:00Commented Feb 4, 2020 at 15:00
-
1\$\begingroup\$ I guess it's on Seqlearn function make that slow. Because there's many looping... \$\endgroup\$Ihsanul– Ihsanul2020年02月04日 15:06:08 +00:00Commented Feb 4, 2020 at 15:06
-
\$\begingroup\$ I'm not quite into panda, numpy and machine learning. It will take me quite a long time to understand your code to, maybe, improve performances. However, I can suggest you to have a look at cProfile, which once dumped into a file can be used with snakeviz. It helps to identify bottlenecks! \$\endgroup\$VincentRG– VincentRG2020年02月04日 15:57:31 +00:00Commented Feb 4, 2020 at 15:57
-
\$\begingroup\$ @AlexV Yes, it's Support Vector Regression. Right, TLKM.CSV is stock market data. Predict future and calculate MAPE. This is journal that I follow: Journal1 and Jurnal2. But Second journal not in english, sorry. \$\endgroup\$Ihsanul– Ihsanul2020年02月04日 23:42:17 +00:00Commented Feb 4, 2020 at 23:42
-
\$\begingroup\$ I think it's look great... \$\endgroup\$Ihsanul– Ihsanul2020年02月05日 08:47:39 +00:00Commented Feb 5, 2020 at 8:47
1 Answer 1
First and foremost: go and get yourself an IDE with an autoformatter, e.g. PyCharm, Visual Studio Code with the Python Plugin (just to name a few, there is a longer list in another post here on Code Review). This will help you to establish a consistent code style, which in turn makes it easier to read and review code. Python comes with an "official" Style Guide for Python Code (aka PEP 8) and those tools greatly help to write code that looks professional. Some aspects to take special care of:
- whitespace before and after
=
in assignments, e.g.distancedata = pd.DataFrame(harray)
lower_case_with_underscore
names for variables and functions, e.g.def distances_train(d3, d2, d1): ...
- writing
"""documentation"""
for your code
Once you have that covered, I highly recommend to have a look at some of the talks of Jake VanderPlas:
- Losing your Loops - Fast Numerical Computing with NumPy
- Performance Python: Seven Strategies for Optimizing Your Numerical Code
Also a highly recommended read to get going with numerical computations in Python: Python Data Science Handbook by the same person. It will take you some time to work through this, but I promise it'll be worth the effort.
A core takeaway of the material I linked to: Loops are slow in plain Python, so it's often best to avoid them as far as possible.
I'll demonstrate that using Distancetrain
def Distancetrain(d3, d2, d1): d = len(d3.index) harray = [] for i in range(d): harray.clear() for j in range(d): harray.append(((d3.iloc[i]-d3.iloc[j])**2) + ((d2.iloc[i]-d2.iloc[j])**2) + ((d1.iloc[i]-d1.iloc[j])**2)) if i < 1: distancedata = pd.DataFrame(harray) else: distancedata[i] = harray return distancedata
Things that make this function slow:
- nested
for
loops in Python - unnecessary computations: a core principle of a distance function like the squared Euclidean distance you are using is, that it is symmetric, i.e. you only have to compute either the upper or lower (triangle) half of the distance matrix
- "hand-written" distance function
- elementwise access to elements of a pandas series: pandas and numpy are optimized to apply the same operation on a lot of elements at the same time. Manually iterating over them can be costly and very slow.
- dynamically growing array(s)
So how can this be improved? Since distance computation is a very common task in all kinds of machine learning applications, there is good library support for it, namely in the scipy.spatial.distance
module of scipy. You are already using numpy, pandas, and sklearn, so there is a great chance that scipy is also available to you.
Looking at the module's documentation I linked to above, shows two very convenient functions: pdist
and cdist
. pdist
is basically equivalent to what Distancetrain
is supposed to do, cdist
will become handy when thinking about improvements on Distancetest
.
With this function, Distancetrain
becomes very easy to implement:
def as_column(series):
"""Reshapes a pandas series to a numpy column vector"""
return series.to_numpy(copy=False).reshape((-1, 1))
def distances_train(d3, d2, d1):
# pdist requires the input to have a shape of (n_samples, n_dimensions)
np_data = np.concatenate((as_column(d3), as_column(d2), as_column(d1)), axis=1)
# squareform is used to get the full distance matrix from the half triangle I talked about earlier
return pd.DataFrame(squareform(pdist(np_data, "sqeuclidean")))
All that reshaping, the concatenation, and the conversion back to a dataframe is basically unnecessary. I only keep them so that the output is compatible with your original code. You can use np.allclose(arr1, arr2)
to see for yourself that the results are indeed identical.
The loops that previously had to be executed by the Python interpreter, are now executed in the underlying library implementation. Numerical Python libraries are usually written in C, and therefore (most of the time) much, much faster than plain Python code when it comes to loops.
An informal timing delivered the following results (average over 10 runs):
original: 15.3467 s
new: 0.0031 s
That's almost 5000x faster!
You can rewrite other parts of your code in the same fashion. It just takes some time to get used to think about the problem in terms of larger array and matrix operations. A tried-and-tested approach to get there is to rewrite parts of your code while keeping the old code around to check if the results match. Sometimes they don't, but that shouldn't discourage you from further looking into it. More often than not, the rewritten version can be correct because it's easier and straightforward without convoluted loops and the like.
Maybe also have a look at numba, a just-in-time compiler for Python code. This can sometimes speed up loops significantly (see here for example). numba does not fully support everything you can do in Python or numpy, so the implementation might need some tweaking to work correctly with it.
Of course profiling is also key in that process and has been mentioned in the comments. Python's built-in cProfile
module is very useful for that purpose. timeit
can also be used to robustly measure the execution time of smaller pieces of code.
-
\$\begingroup\$ Great, I get it. I also read Cython last night and wanna try it too. \$\endgroup\$Ihsanul– Ihsanul2020年02月05日 23:21:21 +00:00Commented Feb 5, 2020 at 23:21
-
\$\begingroup\$ Maybe have a look at numba too, before going all the way to cython. I added a short paragraph on that to the answer. \$\endgroup\$AlexV– AlexV2020年02月06日 10:30:14 +00:00Commented Feb 6, 2020 at 10:30
Explore related questions
See similar questions with these tags.