Suppose I have a single time-dependent variable $y_{t}$ (e.g. stock price) and a few hundred independent variables $X_{it}$ with data available for the same time frame as $y_{t}$ (e.g. company revenue, total market sales, interest rates, value of $y_{t-1}$ etc.) I want to identify a model to use for forecasting using this data.
How do I know which of the independent variables to include?
What is the problem with including all the variables?
My superficial ideas for identification are to use AIC/BIC/R^2 comparison between every single combination of variables in a simple ARIMA and build on that(would be thousands of model calculations) or do Granger causality for every $y_t / x_{it}$ pair. There must be an easier way, surely?
-
1$\begingroup$ Re "every single combination:" there are 2ドル^\text{several hundred}$ such combinations! $\endgroup$whuber– whuber ♦2023年11月14日 19:17:54 +00:00Commented Nov 14, 2023 at 19:17
1 Answer 1
There has been work on selecting variables using the lasso and similar methods in a time series context; see here for some pointers to literature.
Alternatively, you could also consider a PCA.
Explore related questions
See similar questions with these tags.