Compute rolling median absolute deviation (MAD) in R

Question 1

I am computing the rolling Median Absolute Deviation (MAD) in R for outlier detection in a time series dataset. The goal is to:

Detect outliers based on a rolling MAD.
Exclude previously detected outliers from future MAD calculations.
Ensure the rolling window adapts dynamically by using only non-outlier values.

I have a dataset that looks like this:

Date_Label	Attempts
01-04-2024	186,518
01-05-2024	202,397
01-06-2024	252,707
01-07-2024	236,194
01-08-2024	217,135
01-09-2024	240,986
01-10-2024	205,524
01-11-2024	160,624
01-12-2024	142,238
01-01-2025	193,088

library(tibble)
data <- tibble::tibble(
 Date_Label = as.Date(c("2024年04月01日", "2024年05月01日", "2024年06月01日", "2024年07月01日", 
 "2024年08月01日", "2024年09月01日", "2024年10月01日", "2024年11月01日", 
 "2024年12月01日", "2025年01月01日")),
 Attempts = c(186518, 202397, 252707, 236194, 217135, 240986, 205524, 
 160624, 142238, 193088)
)

To detect the outliers, I wrote this piece of code to calculate the rolling MAD:

library(dplyr)
library(zoo)
window_size <- 3
k <- 1
data %>% 
 mutate(attempts_rolling_median = rollapply(Attempts, window_size, median, fill = NA, align = "right"),
 attempts_abs_deviation = abs(Attempts - attempts_rolling_median),
 attempts_rolling_mad = rollapply(attempts_abs_deviation, window_size, median, fill = NA, align = "right", na.rm = TRUE), 
 ) %>% 
 mutate(attempts_lower_bound = attempts_rolling_median - k * attempts_rolling_mad,
 attempts_upper_bound = attempts_rolling_median + k * attempts_rolling_mad) %>% 
 mutate(attempts_anomaly = ifelse(Attempts < attempts_lower_bound, TRUE, FALSE))

I am using rollapply in R to calculate the rolling median and Median Absolute Deviation (MAD) to detect anomalies in a time series dataset.

For example, in my dataset:

Attempts = 160,624 in November is correctly flagged as an outlier.
December shows a downward trend, but I would like December values to also be flagged as outliers due to their deviation from earlier months.
When calculating the MAD for December, I want to exclude November’s outlier and instead use data from October, September, and August.

Issue: Currently, the rolling MAD for December still includes the outlier from November instead of ignoring it.

I tried modifying this approach so that when calculating the rolling MAD for a given month, it excludes previously flagged outliers from the window:

window_size <- 3
k <- 1
data %>% 
 mutate(attempts_rolling_median = rollapply(Attempts, window_size, median, fill = NA, align = "right"),
 attempts_abs_deviation = abs(Attempts - attempts_rolling_median),
 attempts_rolling_mad = rollapply(attempts_abs_deviation, window_size, median, fill = NA, align = "right", na.rm = TRUE), 
 ) %>% 
 mutate(attempts_lower_bound = attempts_rolling_median - k * attempts_rolling_mad,
 attempts_upper_bound = attempts_rolling_median + k * attempts_rolling_mad) %>% 
 mutate(attempts_anomaly = ifelse(Attempts < attempts_lower_bound, TRUE, FALSE)) %>% 
 mutate(
 attempts_cleaned = case_when(attempts_anomaly == TRUE ~ NA, 
 .default = Attempts),
 attempts_rolling_median_cleaned = rollapply(attempts_cleaned, window_size, median, fill = NA, align = "right", na.rm = TRUE)
 ) %>% 
 mutate(attempts_abs_deviation_1 = abs(attempts_cleaned - attempts_rolling_median_cleaned),
 attempts_rolling_mad_1 = rollapply(attempts_abs_deviation_1, window_size, median, fill = NA, align = "right", na.rm = TRUE), 
 ) %>% 
 mutate(attempts_lower_bound_1 = attempts_rolling_median_cleaned - k * attempts_rolling_mad_1,
 attempts_upper_bound_1 = attempts_rolling_median_cleaned + k * attempts_rolling_mad_1) %>% 
 mutate(attempts_anomaly_1 = ifelse(Attempts < attempts_lower_bound_1, TRUE, FALSE))

My current approach manually replaces outliers with NA before recalculating the rolling median. This does not feel efficient, especially for large datasets.

Question: Is there a more efficient way to calculate the rolling MAD while automatically ignoring previously detected outliers?

Any suggestions or alternative approaches would be greatly appreciated.

Question 2

Your strategy of ignoring outliers from one window in computing statistics and outliers for subsequent windows is statistically questionable. It could produce effects such as a majority of all the data being assigned to be outliers. Also, It does not prevent the possibility that some data points are included as non-outliers in some (earlier) windows, but excluded as outliers from other, later windows -- this latter is not necessarily a problem in itself, but it sounds like you might think you are avoiding things like that.

Question 3

Is there a more efficient way to calculate the rolling MAD while automatically ignoring previously detected outliers?

Near as I can tell, that poor median() function keeps considering each slightly moved window from scratch. Prefer the rollmean or roll_median package. Right now window_size is "small", \3ドル\$, but since a naïve maintainer might make it bigger, we wouldn't want it to be a factor in the big-Oh complexity. A sensible algorithm would maintain a heap, adding a new element and removing oldest at each iteration.

pipeline

Replacing outliers with NA is a fair strategy. But it might make more sense to discard them entirely, preserving only inliers.

Consume a single row to initialize the median value, and initialize inliers to consist of just that single value.

For each new row,

test whether it is within current bounds, discarding if outlier
if inlier: append to inliers; compute new trailing window median of inliers, along with bounds

In this way it's as though an outlier event "never happened", and it cannot disturb the median value you compute.

If you do not need all inlier values at the end, there's an opportunity to use a circular buffer to save memory.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2025-03-17 15:58:40Z

Is there a more efficient way to calculate the rolling MAD while automatically ignoring previously detected outliers?

Near as I can tell, that poor median() function keeps considering each slightly moved window from scratch. Prefer the rollmean or roll_median package. Right now window_size is "small", \3ドル\$, but since a naïve maintainer might make it bigger, we wouldn't want it to be a factor in the big-Oh complexity. A sensible algorithm would maintain a heap, adding a new element and removing oldest at each iteration.

pipeline

Replacing outliers with NA is a fair strategy. But it might make more sense to discard them entirely, preserving only inliers.

Consume a single row to initialize the median value, and initialize inliers to consist of just that single value.

For each new row,

test whether it is within current bounds, discarding if outlier
if inlier: append to inliers; compute new trailing window median of inliers, along with bounds

In this way it's as though an outlier event "never happened", and it cannot disturb the median value you compute.

If you do not need all inlier values at the end, there's an opportunity to use a circular buffer to save memory.

Stack Exchange Network

Compute rolling median absolute deviation (MAD) in R

1 Answer 1

pipeline

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compute rolling median absolute deviation (MAD) in R

1 Answer 1

pipeline

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions