Consider a generic (univariate) time series model:
$$ \hat{y}(x_n, \ldots, x_{n+h-1} \mid x_0, \ldots, x_{n-1}; \boldsymbol{\theta}) $$
Where: $$ \hat{y}_{t+h} \text{ is the forecast for the next } h \text{ time points} $$
$$ \boldsymbol{\theta} = (\theta_1, \theta_2, \ldots, \theta_k) \text{ is the vector of model parameters} $$
$$ x_t, x_{t-1}, \ldots, x_{t-p} \text{ are the } p \text{ most recent observations} $$
Cross validation approach:
- Here, we fit a model (e.g ARIMA) to first \$n\$ points (i.e. 0 to \$n\$).
- Then, we use this model from Step 1 to forecast the next \$h\$ points (i.e. \$n+1\$ to \$n+h\$) in a row (in one shot) and record the error.
- We re-fit the same type model (e.g. ARIMA) from 0 to \$n+h\$, and make the model predict again the next \$h\$ points, etc.
- In the future, we will forecast using the model built on the complete set of data available to us.
Mathematically, we can represent this idea as follows:
$$ \text{For each time step } t = n, n+h, \ldots, T-h: $$
$$ \boldsymbol{\hat{\theta}}_t = \arg \min_{\boldsymbol{\theta}} \sum_{i=t-n+1}^t L(y_i, \hat{y}(x_i \mid x_{t-n+1}, \ldots, x_{i-1}; \boldsymbol{\theta})) $$
$$ \hat{y}_{t+h} = \hat{y}(x_{t+h} \mid x_{t-n+1}, \ldots, x_t; \boldsymbol{\hat{\theta}}_t) $$
$$ e_{t+h} = y_{t+h} - \hat{y}_{t+h} $$
Here: $$ L(\cdot) \text{ is a loss function (e.g., squared error)} $$
$$ n \text{ is the size of the initial training window or rolling window} $$
$$ e_{t+h} \text{ is the forecast error} $$
$$ T \text{ is the total number of time points} $$
As we can see, in the dynamic approach, \$\boldsymbol{\hat{\theta}}_t\$ is recomputed at each time step \$t\$ using the most recent \$n\$ observations. The model uses only the most recent \$n\$ observations \$(x_{t-n+1}, \ldots, x_t)\$ for each forecast. This is computationally less efficient as the model fitting process only happens many times. This approach is testing the performance of a "class of models". However, I think this approach is more robust as it generalizes the idea of the model better. I.e. in the future, how reliable will ARIMA be in general.
I tried to represent this visually:
I want to try and write an R function for this cross validation:
library(forecast)
library(ggplot2)
library(tidyverse)
library(gridExtra)
set.seed(123)
#n <- 100: length of the time series to 100 observations, it means that the generated time series will have 100 data points.
#h <- 5: T forecast horizon to 5 : model will predict 5 steps ahead in the future. E.g, if the current time point #is ( t ), the model will forecast #the value at ( t+5 ).
#frequency <- 52: This sets the frequency of simulated time series to 52.
#window_size <- 52: size of the rolling window to 52. the model will use the most recent 52 observations to train the model #at each step. E.g.
#if the current time point is ( t ), the model will use the data from ( t-51 ) to ( t ) to make the forecast.
n <- 100
h <- 5
frequency <- 52
window_size <- 52
##################################
time_series <- arima.sim(model = list(ar = 0.8), n = n) + 10 + 0.5 * (1:n)
ts_data <- ts(time_series, frequency = frequency)
# Rolling window cross-validation function
rolling_cv <- function(data, h = 1, window_size = 52) {
n <- length(data)
forecasts <- numeric(n - window_size - h + 1)
actual <- numeric(n - window_size - h + 1)
for (i in 1:(n - window_size - h + 1)) {
train <- data[i:(i + window_size - 1)]
test <- data[i + window_size + h - 1]
model <- auto.arima(train)
forecast <- forecast(model, h = h)
forecasts[i] <- forecast$mean[h]
actual[i] <- test
}
return(list(forecasts = forecasts, actual = actual))
}
# Perform rolling window cross-validation
cv_results <- rolling_cv(ts_data, h = h, window_size = window_size)
errors <- cv_results$actual - cv_results$forecasts
percent_errors <- 100 * errors / cv_results$actual
plot_data <- data.frame(
Time = (1:(length(cv_results$actual))) + window_size + h - 1,
Actual = cv_results$actual,
Forecast = cv_results$forecasts,
Error = errors,
PercentError = percent_errors
)
p1 <- ggplot(plot_data, aes(x = Time)) +
geom_line(aes(y = Actual, color = "Actual")) +
geom_line(aes(y = Forecast, color = "Forecast")) +
scale_color_manual(values = c("Actual" = "blue", "Forecast" = "red")) +
labs(title = "Actual vs Forecast", y = "Value") +
theme_minimal()
p2 <- ggplot(plot_data, aes(x = Time, y = Error)) +
geom_line(color = "green") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Forecast Errors", y = "Error") +
theme_minimal()
p3 <- ggplot(plot_data, aes(x = Time, y = PercentError)) +
geom_line(color = "purple") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Forecast Percent Errors", y = "Percent Error") +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 1, heights = c(3, 3, 3))
mae <- mean(abs(errors))
rmse <- sqrt(mean(errors^2))
mape <- mean(abs(percent_errors))
cat("Mean Absolute Error:", mae, "\n")
cat("Root Mean Square Error:", rmse, "\n")
cat("Mean Absolute Percentage Error:", mape, "%\n")
Can someone please tell me if I have done this correctly?
1 Answer 1
Can someone please tell me if I have done this correctly?
Looks correct to me. Just a few suggestions.
On the function:
You repeat
n - window_size - h + 1
andi + window_size - 1
a few times. Refactor them into semantic variables (e.g.,n_results
andoffset
) to reduce duplication and improve readability.I'd avoid assignments like
forecast <- forecast(model, h = h)
. Even though variables don't shadow functions in R, it lowers readability. Here I'd just skip the intermediate assignment.rolling_cv <- function(data, h = 1, window_size = 52) { n_results <- length(data) - window_size - h + 1 forecasts <- numeric(n_results) actual <- numeric(n_results) for (i in 1:n_results) { offset <- i + window_size - 1 train <- data[i:offset] test <- data[offset + h] model <- auto.arima(train) forecasts[i] <- forecast(model, h = h)$mean[h] actual[i] <- test } return(list(forecasts = forecasts, actual = actual)) }
On the plot:
The legend currently shrinks the top subplot, which makes it hard(er) to compare the 3 traces. Move the
legend.position
into the plot to align the x-axes.p1 <- ggplot(plot_data, aes(x = Time)) + geom_line(aes(y = Actual, color = "Actual")) + geom_line(aes(y = Forecast, color = "Forecast")) + scale_color_manual(values = c("Actual" = "blue", "Forecast" = "red")) + labs(title = "Actual vs Forecast", y = "Value") + theme_minimal() + theme(legend.position = c(0.9, 0.3)) + theme(legend.box.background = element_rect(colour = "gray"))
-
\$\begingroup\$ I recently am struggling with this question - can you please check it out if you have time? codereview.stackexchange.com/questions/293589/… \$\endgroup\$antonoyaro8– antonoyaro82024年09月05日 13:49:08 +00:00Commented Sep 5, 2024 at 13:49