Rolling OLS algorithm in a dataframe

Question 1

I want to be able to find a solution to run the following code in a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed, just behind iterating rows/cols- and there, there is already a 3x speed decrease). The problem is twofold: how to set this up AND save stuff in other places (an embedded function might do that). I know the pandas function for ROLLING window regression is already optimized to its limit but I was wondering how to get rid of the loop cycle and other \$O(N^k)\$ I might have missed.

Any help is greatly appreciated

import pandas as pd
import numpy as np
periods = 1000
alt_pan_fondi_prices = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu.columns = list('ABCD')
# some names to be used later
cols = ['fund'] + [("bench_" + str(i)) for i in list('ABCD')]
for item in alt_pan_fondi_prices.columns.values:
 to_infer = alt_pan_fondi_prices[item].dropna()
 indu = indu.loc[to_infer.index[0]:, :].dropna()
 dfBothPrices = pd.concat([to_infer, indu], axis=1)
 dfBothPrices = dfBothPrices.fillna(method='bfill')
 dfBothReturns = dfBothPrices.pct_change()
 dfBothReturns.columns = cols
 mask = cols[1:]
 # execute the OLS model
 model = pd.ols(y=dfBothReturns['fund'], x=dfBothReturns[mask], window=20)
 # I then need to store a whole bunch of stuff (alphas / betas / rsquared / etc) but I have this part safely taken care of

Question 2

Archaeology

ols isn't in the current version of Pandas, but it is in (at least) 0.10.

In version v0.19.0-415-g542c9166a6, we see

 warnings.warn("The pandas.stats.ols module is deprecated and will be "
 "removed in a future version. We refer to external packages "
 "like statsmodels, see some examples here: "
 "http://www.statsmodels.org/stable/regression.html",
 FutureWarning, stacklevel=4)

It survived until Thu Feb 9 11:42:15 2017.

The signature of this function created a generic linear model object. The only curious parameter is window, which computes a "moving window regression". So far as I can tell, this is not covered in Pandas or scipy.stats, certainly not in Numpy; but it is in statsmodels. A contemporary implementation should probably use that.

Performance

a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed

No. apply is essentially a loop, with optional numba that is unlikely to help in your case. For the scale you demonstrated - 1000 rows with an outer loop across four columns - individual calls to a rolled regressor are fine.

If you wanted to get really tricky, you could use Numpy, sliding_window_view and one call to lstsq to regress across the outer column dimension, but I consider this premature optimisation and unlikely to be worth it until 1M+ rows.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2024-11-08 13:54:11Z

Archaeology

ols isn't in the current version of Pandas, but it is in (at least) 0.10.

In version v0.19.0-415-g542c9166a6, we see

 warnings.warn("The pandas.stats.ols module is deprecated and will be "
 "removed in a future version. We refer to external packages "
 "like statsmodels, see some examples here: "
 "http://www.statsmodels.org/stable/regression.html",
 FutureWarning, stacklevel=4)

It survived until Thu Feb 9 11:42:15 2017.

The signature of this function created a generic linear model object. The only curious parameter is window, which computes a "moving window regression". So far as I can tell, this is not covered in Pandas or scipy.stats, certainly not in Numpy; but it is in statsmodels. A contemporary implementation should probably use that.

Performance

a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed

No. apply is essentially a loop, with optional numba that is unlikely to help in your case. For the scale you demonstrated - 1000 rows with an outer loop across four columns - individual calls to a rolled regressor are fine.

If you wanted to get really tricky, you could use Numpy, sliding_window_view and one call to lstsq to regress across the outer column dimension, but I consider this premature optimisation and unlikely to be worth it until 1M+ rows.

Stack Exchange Network

Rolling OLS algorithm in a dataframe

1 Answer 1

Archaeology

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Rolling OLS algorithm in a dataframe

1 Answer 1

Archaeology

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions