3
\$\begingroup\$

I want to be able to find a solution to run the following code in a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed, just behind iterating rows/cols- and there, there is already a 3x speed decrease). The problem is twofold: how to set this up AND save stuff in other places (an embedded function might do that). I know the pandas function for ROLLING window regression is already optimized to its limit but I was wondering how to get rid of the loop cycle and other \$O(N^k)\$ I might have missed.

Any help is greatly appreciated

import pandas as pd
import numpy as np
periods = 1000
alt_pan_fondi_prices = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu.columns = list('ABCD')
# some names to be used later
cols = ['fund'] + [("bench_" + str(i)) for i in list('ABCD')]
for item in alt_pan_fondi_prices.columns.values:
 to_infer = alt_pan_fondi_prices[item].dropna()
 indu = indu.loc[to_infer.index[0]:, :].dropna()
 dfBothPrices = pd.concat([to_infer, indu], axis=1)
 dfBothPrices = dfBothPrices.fillna(method='bfill')
 dfBothReturns = dfBothPrices.pct_change()
 dfBothReturns.columns = cols
 mask = cols[1:]
 # execute the OLS model
 model = pd.ols(y=dfBothReturns['fund'], x=dfBothReturns[mask], window=20)
 # I then need to store a whole bunch of stuff (alphas / betas / rsquared / etc) but I have this part safely taken care of
Phrancis
20.5k6 gold badges69 silver badges155 bronze badges
asked Apr 26, 2016 at 16:21
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$

Archaeology

ols isn't in the current version of Pandas, but it is in (at least) 0.10.

In version v0.19.0-415-g542c9166a6, we see

 warnings.warn("The pandas.stats.ols module is deprecated and will be "
 "removed in a future version. We refer to external packages "
 "like statsmodels, see some examples here: "
 "http://www.statsmodels.org/stable/regression.html",
 FutureWarning, stacklevel=4)

It survived until Thu Feb 9 11:42:15 2017.

The signature of this function created a generic linear model object. The only curious parameter is window, which computes a "moving window regression". So far as I can tell, this is not covered in Pandas or scipy.stats, certainly not in Numpy; but it is in statsmodels. A contemporary implementation should probably use that.

Performance

a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed

No. apply is essentially a loop, with optional numba that is unlikely to help in your case. For the scale you demonstrated - 1000 rows with an outer loop across four columns - individual calls to a rolled regressor are fine.

If you wanted to get really tricky, you could use Numpy, sliding_window_view and one call to lstsq to regress across the outer column dimension, but I consider this premature optimisation and unlikely to be worth it until 1M+ rows.

answered Nov 8, 2024 at 13:54
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.