I want to be able to find a solution to run the following code in a much faster fashion (ideally something like dataframe.apply(func)
which has the fastest speed, just behind iterating rows/cols- and there, there is already a 3x speed decrease). The problem is twofold: how to set this up AND save stuff in other places (an embedded function might do that). I know the pandas function for ROLLING window regression is already optimized to its limit but I was wondering how to get rid of the loop cycle and other \$O(N^k)\$ I might have missed.
Any help is greatly appreciated
import pandas as pd
import numpy as np
periods = 1000
alt_pan_fondi_prices = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu.columns = list('ABCD')
# some names to be used later
cols = ['fund'] + [("bench_" + str(i)) for i in list('ABCD')]
for item in alt_pan_fondi_prices.columns.values:
to_infer = alt_pan_fondi_prices[item].dropna()
indu = indu.loc[to_infer.index[0]:, :].dropna()
dfBothPrices = pd.concat([to_infer, indu], axis=1)
dfBothPrices = dfBothPrices.fillna(method='bfill')
dfBothReturns = dfBothPrices.pct_change()
dfBothReturns.columns = cols
mask = cols[1:]
# execute the OLS model
model = pd.ols(y=dfBothReturns['fund'], x=dfBothReturns[mask], window=20)
# I then need to store a whole bunch of stuff (alphas / betas / rsquared / etc) but I have this part safely taken care of
1 Answer 1
Archaeology
ols
isn't in the current version of Pandas, but it is in (at least) 0.10.
In version v0.19.0-415-g542c9166a6, we see
warnings.warn("The pandas.stats.ols module is deprecated and will be "
"removed in a future version. We refer to external packages "
"like statsmodels, see some examples here: "
"http://www.statsmodels.org/stable/regression.html",
FutureWarning, stacklevel=4)
It survived until Thu Feb 9 11:42:15 2017.
The signature of this function created a generic linear model object. The only curious parameter is window
, which computes a "moving window regression". So far as I can tell, this is not covered in Pandas or scipy.stats, certainly not in Numpy; but it is in statsmodels. A contemporary implementation should probably use that.
Performance
a much faster fashion (ideally something like
dataframe.apply(func)
which has the fastest speed
No. apply
is essentially a loop, with optional numba
that is unlikely to help in your case. For the scale you demonstrated - 1000 rows with an outer loop across four columns - individual calls to a rolled regressor are fine.
If you wanted to get really tricky, you could use Numpy, sliding_window_view and one call to lstsq to regress across the outer column dimension, but I consider this premature optimisation and unlikely to be worth it until 1M+ rows.