Forecasting stock market data using Support Vector Regression

Question 1

I coded this Support Vector Regression (SVR) myself following some equations in a journal (see here, or here (not in English)). The loss function used by the journal and the code below is mean absolute percentage error (MAPE).

I need to make it run faster because I will use this function 1600 times for evaluations. If I use this code it will be running for a couple days or even a week for one run.

How can I make it run faster? I'm beginner in Python (first time coding in Python).

This example stock market data I use: TLKM.CSV

You can see the code here: SVRpython.py or below:

import csv
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import random as Rand
from pandas import DataFrame
from sklearn.model_selection import train_test_split
import pdb
import time
nstart=time.process_time()
# pdb.set_trace()
# import IPython as IP
data = pd.read_csv("TLKM.csv")
def Distancetrain(d3, d2, d1):
 d=len(d3.index)
 harray=[]
 for i in range(d):
 harray.clear()
 for j in range(d):
 harray.append(((d3.iloc[i]-d3.iloc[j])**2) + ((d2.iloc[i]-d2.iloc[j])**2) + ((d1.iloc[i]-d1.iloc[j])**2))
 if i < 1:
 distancedata=pd.DataFrame(harray)
 else:
 distancedata[i]=harray
 print("distance train")
 print(time.process_time()-nstart)
 return distancedata
def Distancetest(d3train, d2train, d1train, d3test, d2test, d1test):
 dtrain=len(d3train.index)
 dtest=len(d3test.index)
 harray=[]
 for i in range(dtrain):
 harray.clear()
 for j in range(dtest):
 harray.append(((d3test.iloc[j]-d3train.iloc[i])**2) + ((d2test.iloc[j]-d2train.iloc[i])**2) + ((d1test.iloc[j]-d1train.iloc[i])**2))
 if i < 1:
 distancedata=pd.DataFrame(harray)
 else:
 distancedata[i]=harray
 print("distance test")
 print(time.process_time()-nstart)
 return distancedata
def Hessian(dfdistance, sigma, lamda):
 d=len(dfdistance.index)
 col=len(dfdistance.columns)
 hes = np.array([], dtype=np.float64).reshape(0,col)
 tampung = [[0] * col]
 sig2= 2*(sigma**2)
 lam2=lamda**2
 for i in range(d):
 for j in range(col):
 tampung[0][j]=np.exp(-1*((dfdistance.iloc[i][j])/(sig2))) + (lam2)
 hes=np.vstack([hes, tampung])
 dfhessian=pd.DataFrame(hes)
 print("hessian")
 print(time.process_time()-nstart)
 return dfhessian
def Seqlearn(y, dfhessian, gamma, eps, c, itermaxsvr):
 d=len(dfhessian.index)
 a = [[0] * d]
 a_s = [[0] * d]
 la = [[0] * d]
 la_s = [[0] * d]
 E = np.array([], dtype=np.float64).reshape(0,d)
 Etemp = [[0] * d]
 da_s = np.array([], dtype=np.float64).reshape(0,d)
 da = np.array([], dtype=np.float64).reshape(0,d)
 dat_s = [[0] * d]
 dat = [[0] * d]
 tempas = [[0] * d]
 tempa = [[0] * d]
 for i in range(itermaxsvr):
 for j in range(d):
 Rijhelp=0
 for k in range(d):
 Rijhelp = Rijhelp + ((a_s[i][k] - a[i][k])*(dfhessian.iloc[j][k]))
 Etemp[0][j]= y.iloc[j] - Rijhelp
 E=np.vstack([E, Etemp])
 for l in range(d):
 dat_s[0][l]=min(max(gamma*(E[i][l] - eps), -1*(a_s[i][l])), (c - a_s[i][l]))
 dat[0][l]=min(max(gamma*(-(E[i][l]) - eps), -1*(a[i][l])), (c - a[i][l]))
 tempas[0][l]= a_s[i][l] + dat_s[0][l]
 tempa[0][l]= a[i][l] + dat[0][l]
 da_s=np.vstack([da_s, dat_s])
 da=np.vstack([da, dat])
 a=np.vstack([a, tempa])
 a_s=np.vstack([a_s, tempas])
 la=tempa
 la_s=tempas
# (|da|<eps and |das|<eps ) or max iterasi
 dat_abs=max([abs(xdat) for xdat in dat[0]])
 dat_s_abs=max([abs(xdats) for xdats in dat_s[0]])
 print(dat_abs)
 print(dat_s_abs)
 if (dat_abs < eps) and (dat_s_abs < eps):
 print(time.process_time()-nstart)
 break
 print(time.process_time()-nstart)
 return la, la_s
def Predictf(a, a_s, dfhessian):
# predict = sum ((a_s[0][k]-a[0][k]) * hessian[j][k])
 row=len(dfhessian.index)
 col=len(dfhessian.columns)
 for j in range(row):
 datax=0
 for k in range(col):
 datax= datax + ((a_s[0][k] - a[0][k])*(dfhessian.iloc[j][k]))
 if (j == 0):
 dataxm=datax
 elif (j > 0):
 dataxm=np.vstack([dataxm, datax])
 print("predict")
 print(time.process_time()-nstart)
 return dataxm
def Normalization(datain, closemax, closemin):
 dataout=(datain - closemin)/(closemax - closemin)
 return dataout
def SVRf(df, closemax, closemin, c, lamda, eps, sigma, gamma, itermaxsvr):
 result = df.assign(Day_3 = Normalization(df.Day_3, closemax, closemin), Day_2=Normalization(df.Day_2, closemax, closemin), Day_1=Normalization(df.Day_1, closemax, closemin), Actual=Normalization(df.Actual, closemax, closemin))
 X_train, X_test, y_train, y_test, d3_train, d3_test, d2_train, d2_test, d1_train, d1_test, date_train, date_test = train_test_split(result['Index'], result['Actual'], result['Day_3'], result['Day_2'], result['Day_1'], result['Date'], train_size=0.9, test_size=0.1, shuffle=False)
 distancetrain=Distancetrain(d3_train, d2_train, d1_train)
 mhessian=Hessian(distancetrain, sigma, lamda)
 a, a_s = Seqlearn(y_train, mhessian, gamma, eps, c, itermaxsvr)
 distancetest=Distancetest(d3_train, d2_train, d1_train, d3_test, d2_test, d1_test)
 testhessian=Hessian(distancetest, sigma, lamda)
 predict = Predictf(a, a_s, testhessian)
 hasilpre=pd.DataFrame()
 tgltest = date_test
 tgltest.reset_index(drop=True, inplace=True)
 hasilpre['Tanggal'] = tgltest
 hasilpre['Close'] = predict
 deresult = hasilpre.assign(Close=(hasilpre.Close * (closemax - closemin) + closemin))
 n=len(y_test)
 aktualtest = (y_test * (closemax - closemin)) + closemin
 aktualtest.reset_index(inplace=True, drop=True)
 dpredict = pd.Series(deresult['Close'], index=deresult.index)
 hasil = aktualtest - dpredict
 hasil1 = (hasil / aktualtest).abs()
 suma = hasil1.sum()
 mape = (1/n) * suma
 print("MAPE")
 print(mape)
 fitness = 1/(1+mape)
 print(fitness)
 return fitness, mape, hasilpre
Closemax=data['Close'].max()
Closemin=data['Close'].min()
print(Closemax)
print(Closemin)
day3 = data['Close'][0:((-1)-2)]
day2 = data['Close'][1:((-1)-1)]
day2.index = day2.index - 1
day1 = data['Close'][2:((-1)-0)]
day1.index = day1.index - 2
dayact = data['Close'][3:]
dayact.index = dayact.index - 3
dateact = data['Tanggal'][3:]
dateact.index = dateact.index - 3
mydata = pd.DataFrame({'Index':data['Index'][0:((-1)-2)], 'Date':dateact, 'Day_3':day3, 'Day_2':day2, 'Day_1':day1, 'Actual':dayact})
print("data proses",time.process_time()-nstart)
Lamda=0.09
C=200
Eps=0.0013
Sigma=0.11
Gamma=0.004
Itermaxsvr=1000
SVRf(mydata, Closemax, Closemin, C, Lamda, Eps, Sigma, Gamma, Itermaxsvr)
nstop=time.process_time()
print(nstop-nstart)

Question 2

Welcome to CodeReview@SE. Try and improve the title: have it tell coders not into machine learning what the code presented is to accomplish, see How do I ask a Good Question? If you can, add information where most time is spent, a meaningful execution time profile.

Question 3

I guess it's on Seqlearn function make that slow. Because there's many looping...

Question 4

I'm not quite into panda, numpy and machine learning. It will take me quite a long time to understand your code to, maybe, improve performances. However, I can suggest you to have a look at cProfile, which once dumped into a file can be used with snakeviz. It helps to identify bottlenecks!

Question 5

@AlexV Yes, it's Support Vector Regression. Right, TLKM.CSV is stock market data. Predict future and calculate MAPE. This is journal that I follow: Journal1 and Jurnal2. But Second journal not in english, sorry.

Question 6

I think it's look great...

Question 7

First and foremost: go and get yourself an IDE with an autoformatter, e.g. PyCharm, Visual Studio Code with the Python Plugin (just to name a few, there is a longer list in another post here on Code Review). This will help you to establish a consistent code style, which in turn makes it easier to read and review code. Python comes with an "official" Style Guide for Python Code (aka PEP 8) and those tools greatly help to write code that looks professional. Some aspects to take special care of:

whitespace before and after = in assignments, e.g. distancedata = pd.DataFrame(harray)
lower_case_with_underscore names for variables and functions, e.g. def distances_train(d3, d2, d1): ...
writing """documentation""" for your code

Once you have that covered, I highly recommend to have a look at some of the talks of Jake VanderPlas:

Also a highly recommended read to get going with numerical computations in Python: Python Data Science Handbook by the same person. It will take you some time to work through this, but I promise it'll be worth the effort.

A core takeaway of the material I linked to: Loops are slow in plain Python, so it's often best to avoid them as far as possible.

I'll demonstrate that using Distancetrain

def Distancetrain(d3, d2, d1):
 d = len(d3.index)
 harray = []
 for i in range(d):
 harray.clear()
 for j in range(d):
 harray.append(((d3.iloc[i]-d3.iloc[j])**2) + ((d2.iloc[i]-d2.iloc[j])**2) + ((d1.iloc[i]-d1.iloc[j])**2))
 if i < 1:
 distancedata = pd.DataFrame(harray)
 else:
 distancedata[i] = harray
 return distancedata

Things that make this function slow:

nested for loops in Python
unnecessary computations: a core principle of a distance function like the squared Euclidean distance you are using is, that it is symmetric, i.e. you only have to compute either the upper or lower (triangle) half of the distance matrix
"hand-written" distance function
elementwise access to elements of a pandas series: pandas and numpy are optimized to apply the same operation on a lot of elements at the same time. Manually iterating over them can be costly and very slow.
dynamically growing array(s)

So how can this be improved? Since distance computation is a very common task in all kinds of machine learning applications, there is good library support for it, namely in the scipy.spatial.distance module of scipy. You are already using numpy, pandas, and sklearn, so there is a great chance that scipy is also available to you.

Looking at the module's documentation I linked to above, shows two very convenient functions: pdist and cdist. pdist is basically equivalent to what Distancetrain is supposed to do, cdist will become handy when thinking about improvements on Distancetest.

With this function, Distancetrain becomes very easy to implement:

def as_column(series):
 """Reshapes a pandas series to a numpy column vector"""
 return series.to_numpy(copy=False).reshape((-1, 1))
def distances_train(d3, d2, d1):
 # pdist requires the input to have a shape of (n_samples, n_dimensions)
 np_data = np.concatenate((as_column(d3), as_column(d2), as_column(d1)), axis=1)
 # squareform is used to get the full distance matrix from the half triangle I talked about earlier
 return pd.DataFrame(squareform(pdist(np_data, "sqeuclidean")))

All that reshaping, the concatenation, and the conversion back to a dataframe is basically unnecessary. I only keep them so that the output is compatible with your original code. You can use np.allclose(arr1, arr2) to see for yourself that the results are indeed identical.

The loops that previously had to be executed by the Python interpreter, are now executed in the underlying library implementation. Numerical Python libraries are usually written in C, and therefore (most of the time) much, much faster than plain Python code when it comes to loops.

An informal timing delivered the following results (average over 10 runs):

original: 15.3467 s
new: 0.0031 s

That's almost 5000x faster!

You can rewrite other parts of your code in the same fashion. It just takes some time to get used to think about the problem in terms of larger array and matrix operations. A tried-and-tested approach to get there is to rewrite parts of your code while keeping the old code around to check if the results match. Sometimes they don't, but that shouldn't discourage you from further looking into it. More often than not, the rewritten version can be correct because it's easier and straightforward without convoluted loops and the like.

Maybe also have a look at numba, a just-in-time compiler for Python code. This can sometimes speed up loops significantly (see here for example). numba does not fully support everything you can do in Python or numpy, so the implementation might need some tweaking to work correctly with it.

Of course profiling is also key in that process and has been mentioned in the comments. Python's built-in cProfile module is very useful for that purpose. timeit can also be used to robustly measure the execution time of smaller pieces of code.

Question 8

Great, I get it. I also read Cython last night and wanna try it too.

Question 9

Maybe have a look at numba too, before going all the way to cython. I added a short paragraph on that to the answer.

AlexV AlexV 7,3532 gold badges24 silver badges47 bronze badges · Accepted Answer · 2020-02-05 23:12:00Z

First and foremost: go and get yourself an IDE with an autoformatter, e.g. PyCharm, Visual Studio Code with the Python Plugin (just to name a few, there is a longer list in another post here on Code Review). This will help you to establish a consistent code style, which in turn makes it easier to read and review code. Python comes with an "official" Style Guide for Python Code (aka PEP 8) and those tools greatly help to write code that looks professional. Some aspects to take special care of:

whitespace before and after = in assignments, e.g. distancedata = pd.DataFrame(harray)
lower_case_with_underscore names for variables and functions, e.g. def distances_train(d3, d2, d1): ...
writing """documentation""" for your code

Once you have that covered, I highly recommend to have a look at some of the talks of Jake VanderPlas:

Also a highly recommended read to get going with numerical computations in Python: Python Data Science Handbook by the same person. It will take you some time to work through this, but I promise it'll be worth the effort.

A core takeaway of the material I linked to: Loops are slow in plain Python, so it's often best to avoid them as far as possible.

I'll demonstrate that using Distancetrain

def Distancetrain(d3, d2, d1):
 d = len(d3.index)
 harray = []
 for i in range(d):
 harray.clear()
 for j in range(d):
 harray.append(((d3.iloc[i]-d3.iloc[j])**2) + ((d2.iloc[i]-d2.iloc[j])**2) + ((d1.iloc[i]-d1.iloc[j])**2))
 if i < 1:
 distancedata = pd.DataFrame(harray)
 else:
 distancedata[i] = harray
 return distancedata

Things that make this function slow:

nested for loops in Python
unnecessary computations: a core principle of a distance function like the squared Euclidean distance you are using is, that it is symmetric, i.e. you only have to compute either the upper or lower (triangle) half of the distance matrix
"hand-written" distance function
elementwise access to elements of a pandas series: pandas and numpy are optimized to apply the same operation on a lot of elements at the same time. Manually iterating over them can be costly and very slow.
dynamically growing array(s)

So how can this be improved? Since distance computation is a very common task in all kinds of machine learning applications, there is good library support for it, namely in the scipy.spatial.distance module of scipy. You are already using numpy, pandas, and sklearn, so there is a great chance that scipy is also available to you.

Looking at the module's documentation I linked to above, shows two very convenient functions: pdist and cdist. pdist is basically equivalent to what Distancetrain is supposed to do, cdist will become handy when thinking about improvements on Distancetest.

With this function, Distancetrain becomes very easy to implement:

def as_column(series):
 """Reshapes a pandas series to a numpy column vector"""
 return series.to_numpy(copy=False).reshape((-1, 1))
def distances_train(d3, d2, d1):
 # pdist requires the input to have a shape of (n_samples, n_dimensions)
 np_data = np.concatenate((as_column(d3), as_column(d2), as_column(d1)), axis=1)
 # squareform is used to get the full distance matrix from the half triangle I talked about earlier
 return pd.DataFrame(squareform(pdist(np_data, "sqeuclidean")))

All that reshaping, the concatenation, and the conversion back to a dataframe is basically unnecessary. I only keep them so that the output is compatible with your original code. You can use np.allclose(arr1, arr2) to see for yourself that the results are indeed identical.

The loops that previously had to be executed by the Python interpreter, are now executed in the underlying library implementation. Numerical Python libraries are usually written in C, and therefore (most of the time) much, much faster than plain Python code when it comes to loops.

An informal timing delivered the following results (average over 10 runs):

original: 15.3467 s
new: 0.0031 s

That's almost 5000x faster!

You can rewrite other parts of your code in the same fashion. It just takes some time to get used to think about the problem in terms of larger array and matrix operations. A tried-and-tested approach to get there is to rewrite parts of your code while keeping the old code around to check if the results match. Sometimes they don't, but that shouldn't discourage you from further looking into it. More often than not, the rewritten version can be correct because it's easier and straightforward without convoluted loops and the like.

Maybe also have a look at numba, a just-in-time compiler for Python code. This can sometimes speed up loops significantly (see here for example). numba does not fully support everything you can do in Python or numpy, so the implementation might need some tweaking to work correctly with it.

Of course profiling is also key in that process and has been mentioned in the comments. Python's built-in cProfile module is very useful for that purpose. timeit can also be used to robustly measure the execution time of smaller pieces of code.

Great, I get it. I also read Cython last night and wanna try it too.
Maybe have a look at numba too, before going all the way to cython. I added a short paragraph on that to the answer.

Stack Exchange Network

Forecasting stock market data using Support Vector Regression

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Forecasting stock market data using Support Vector Regression

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions