R function to do rolling cross validation

Question 1

Consider a generic (univariate) time series model:

$$ \hat{y}(x_n, \ldots, x_{n+h-1} \mid x_0, \ldots, x_{n-1}; \boldsymbol{\theta}) $$

Where: $$ \hat{y}_{t+h} \text{ is the forecast for the next } h \text{ time points} $$

$$ \boldsymbol{\theta} = (\theta_1, \theta_2, \ldots, \theta_k) \text{ is the vector of model parameters} $$

$$ x_t, x_{t-1}, \ldots, x_{t-p} \text{ are the } p \text{ most recent observations} $$

Cross validation approach:

Here, we fit a model (e.g ARIMA) to first \$n\$ points (i.e. 0 to \$n\$).
Then, we use this model from Step 1 to forecast the next \$h\$ points (i.e. \$n+1\$ to \$n+h\$) in a row (in one shot) and record the error.
We re-fit the same type model (e.g. ARIMA) from 0 to \$n+h\$, and make the model predict again the next \$h\$ points, etc.
In the future, we will forecast using the model built on the complete set of data available to us.

Mathematically, we can represent this idea as follows:

$$ \text{For each time step } t = n, n+h, \ldots, T-h: $$

$$ \boldsymbol{\hat{\theta}}_t = \arg \min_{\boldsymbol{\theta}} \sum_{i=t-n+1}^t L(y_i, \hat{y}(x_i \mid x_{t-n+1}, \ldots, x_{i-1}; \boldsymbol{\theta})) $$

$$ \hat{y}_{t+h} = \hat{y}(x_{t+h} \mid x_{t-n+1}, \ldots, x_t; \boldsymbol{\hat{\theta}}_t) $$

$$ e_{t+h} = y_{t+h} - \hat{y}_{t+h} $$

Here: $$ L(\cdot) \text{ is a loss function (e.g., squared error)} $$

$$ n \text{ is the size of the initial training window or rolling window} $$

$$ e_{t+h} \text{ is the forecast error} $$

$$ T \text{ is the total number of time points} $$

As we can see, in the dynamic approach, \$\boldsymbol{\hat{\theta}}_t\$ is recomputed at each time step \$t\$ using the most recent \$n\$ observations. The model uses only the most recent \$n\$ observations \$(x_{t-n+1}, \ldots, x_t)\$ for each forecast. This is computationally less efficient as the model fitting process only happens many times. This approach is testing the performance of a "class of models". However, I think this approach is more robust as it generalizes the idea of the model better. I.e. in the future, how reliable will ARIMA be in general.

I tried to represent this visually:

enter image description here

I want to try and write an R function for this cross validation:

library(forecast)
library(ggplot2)
library(tidyverse)
library(gridExtra)
set.seed(123)
#n <- 100: length of the time series to 100 observations, it means that the generated time series will have 100 data points.
#h <- 5: T forecast horizon to 5 : model will predict 5 steps ahead in the future. E.g, if the current time point #is ( t ), the model will forecast #the value at ( t+5 ).
#frequency <- 52: This sets the frequency of simulated time series to 52.
#window_size <- 52: size of the rolling window to 52. the model will use the most recent 52 observations to train the model #at each step. E.g.
#if the current time point is ( t ), the model will use the data from ( t-51 ) to ( t ) to make the forecast.
n <- 100
h <- 5
frequency <- 52
window_size <- 52
##################################
time_series <- arima.sim(model = list(ar = 0.8), n = n) + 10 + 0.5 * (1:n)
ts_data <- ts(time_series, frequency = frequency)
# Rolling window cross-validation function
rolling_cv <- function(data, h = 1, window_size = 52) {
 n <- length(data)
 forecasts <- numeric(n - window_size - h + 1)
 actual <- numeric(n - window_size - h + 1)
 
 for (i in 1:(n - window_size - h + 1)) {
 train <- data[i:(i + window_size - 1)]
 test <- data[i + window_size + h - 1]
 
 model <- auto.arima(train)
 forecast <- forecast(model, h = h)
 
 forecasts[i] <- forecast$mean[h]
 actual[i] <- test
 }
 
 return(list(forecasts = forecasts, actual = actual))
}
# Perform rolling window cross-validation
cv_results <- rolling_cv(ts_data, h = h, window_size = window_size)
errors <- cv_results$actual - cv_results$forecasts
percent_errors <- 100 * errors / cv_results$actual
plot_data <- data.frame(
 Time = (1:(length(cv_results$actual))) + window_size + h - 1,
 Actual = cv_results$actual,
 Forecast = cv_results$forecasts,
 Error = errors,
 PercentError = percent_errors
)
p1 <- ggplot(plot_data, aes(x = Time)) +
 geom_line(aes(y = Actual, color = "Actual")) +
 geom_line(aes(y = Forecast, color = "Forecast")) +
 scale_color_manual(values = c("Actual" = "blue", "Forecast" = "red")) +
 labs(title = "Actual vs Forecast", y = "Value") +
 theme_minimal()
p2 <- ggplot(plot_data, aes(x = Time, y = Error)) +
 geom_line(color = "green") +
 geom_hline(yintercept = 0, linetype = "dashed") +
 labs(title = "Forecast Errors", y = "Error") +
 theme_minimal()
p3 <- ggplot(plot_data, aes(x = Time, y = PercentError)) +
 geom_line(color = "purple") +
 geom_hline(yintercept = 0, linetype = "dashed") +
 labs(title = "Forecast Percent Errors", y = "Percent Error") +
 theme_minimal()
grid.arrange(p1, p2, p3, ncol = 1, heights = c(3, 3, 3))
mae <- mean(abs(errors))
rmse <- sqrt(mean(errors^2))
mape <- mean(abs(percent_errors))
cat("Mean Absolute Error:", mae, "\n")
cat("Root Mean Square Error:", rmse, "\n")
cat("Mean Absolute Percentage Error:", mape, "%\n")

enter image description here

Can someone please tell me if I have done this correctly?

Question 2

Can someone please tell me if I have done this correctly?

Looks correct to me. Just a few suggestions.

On the function:

You repeat n - window_size - h + 1 and i + window_size - 1 a few times. Refactor them into semantic variables (e.g., n_results and offset) to reduce duplication and improve readability.

I'd avoid assignments like forecast <- forecast(model, h = h). Even though variables don't shadow functions in R, it lowers readability. Here I'd just skip the intermediate assignment.

rolling_cv <- function(data, h = 1, window_size = 52) {
 n_results <- length(data) - window_size - h + 1
 forecasts <- numeric(n_results)
 actual <- numeric(n_results)
 for (i in 1:n_results) {
 offset <- i + window_size - 1
 train <- data[i:offset]
 test <- data[offset + h]
 model <- auto.arima(train)
 forecasts[i] <- forecast(model, h = h)$mean[h]
 actual[i] <- test
 }
 return(list(forecasts = forecasts, actual = actual))
}

On the plot:

The legend currently shrinks the top subplot, which makes it hard(er) to compare the 3 traces. Move the legend.position into the plot to align the x-axes.

p1 <- ggplot(plot_data, aes(x = Time)) +
 geom_line(aes(y = Actual, color = "Actual")) +
 geom_line(aes(y = Forecast, color = "Forecast")) +
 scale_color_manual(values = c("Actual" = "blue", "Forecast" = "red")) +
 labs(title = "Actual vs Forecast", y = "Value") +
 theme_minimal() +
 theme(legend.position = c(0.9, 0.3)) +
 theme(legend.box.background = element_rect(colour = "gray"))

aligned time series

Question 3

I recently am struggling with this question - can you please check it out if you have time? codereview.stackexchange.com/questions/293589/…

tdy tdy 2,2661 gold badge10 silver badges21 bronze badges · Accepted Answer · 2024-08-16 18:35:45Z

Can someone please tell me if I have done this correctly?

Looks correct to me. Just a few suggestions.

On the function:

You repeat n - window_size - h + 1 and i + window_size - 1 a few times. Refactor them into semantic variables (e.g., n_results and offset) to reduce duplication and improve readability.

I'd avoid assignments like forecast <- forecast(model, h = h). Even though variables don't shadow functions in R, it lowers readability. Here I'd just skip the intermediate assignment.

rolling_cv <- function(data, h = 1, window_size = 52) {
 n_results <- length(data) - window_size - h + 1
 forecasts <- numeric(n_results)
 actual <- numeric(n_results)
 for (i in 1:n_results) {
 offset <- i + window_size - 1
 train <- data[i:offset]
 test <- data[offset + h]
 model <- auto.arima(train)
 forecasts[i] <- forecast(model, h = h)$mean[h]
 actual[i] <- test
 }
 return(list(forecasts = forecasts, actual = actual))
}

On the plot:

The legend currently shrinks the top subplot, which makes it hard(er) to compare the 3 traces. Move the legend.position into the plot to align the x-axes.

p1 <- ggplot(plot_data, aes(x = Time)) +
 geom_line(aes(y = Actual, color = "Actual")) +
 geom_line(aes(y = Forecast, color = "Forecast")) +
 scale_color_manual(values = c("Actual" = "blue", "Forecast" = "red")) +
 labs(title = "Actual vs Forecast", y = "Value") +
 theme_minimal() +
 theme(legend.position = c(0.9, 0.3)) +
 theme(legend.box.background = element_rect(colour = "gray"))

aligned time series

I recently am struggling with this question - can you please check it out if you have time? codereview.stackexchange.com/questions/293589/…

Stack Exchange Network

R function to do rolling cross validation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

R function to do rolling cross validation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions