I am computing the rolling Median Absolute Deviation (MAD) in R for outlier detection in a time series dataset. The goal is to:
- Detect outliers based on a rolling MAD.
- Exclude previously detected outliers from future MAD calculations.
- Ensure the rolling window adapts dynamically by using only non-outlier values.
I have a dataset that looks like this:
Date_Label | Attempts |
---|---|
01-04-2024 | 186,518 |
01-05-2024 | 202,397 |
01-06-2024 | 252,707 |
01-07-2024 | 236,194 |
01-08-2024 | 217,135 |
01-09-2024 | 240,986 |
01-10-2024 | 205,524 |
01-11-2024 | 160,624 |
01-12-2024 | 142,238 |
01-01-2025 | 193,088 |
library(tibble)
data <- tibble::tibble(
Date_Label = as.Date(c("2024年04月01日", "2024年05月01日", "2024年06月01日", "2024年07月01日",
"2024年08月01日", "2024年09月01日", "2024年10月01日", "2024年11月01日",
"2024年12月01日", "2025年01月01日")),
Attempts = c(186518, 202397, 252707, 236194, 217135, 240986, 205524,
160624, 142238, 193088)
)
To detect the outliers, I wrote this piece of code to calculate the rolling MAD:
library(dplyr)
library(zoo)
window_size <- 3
k <- 1
data %>%
mutate(attempts_rolling_median = rollapply(Attempts, window_size, median, fill = NA, align = "right"),
attempts_abs_deviation = abs(Attempts - attempts_rolling_median),
attempts_rolling_mad = rollapply(attempts_abs_deviation, window_size, median, fill = NA, align = "right", na.rm = TRUE),
) %>%
mutate(attempts_lower_bound = attempts_rolling_median - k * attempts_rolling_mad,
attempts_upper_bound = attempts_rolling_median + k * attempts_rolling_mad) %>%
mutate(attempts_anomaly = ifelse(Attempts < attempts_lower_bound, TRUE, FALSE))
I am using rollapply
in R to calculate the rolling median and Median Absolute Deviation (MAD) to detect anomalies in a time series dataset.
For example, in my dataset:
- Attempts = 160,624 in November is correctly flagged as an outlier.
- December shows a downward trend, but I would like December values to also be flagged as outliers due to their deviation from earlier months.
- When calculating the MAD for December, I want to exclude November’s outlier and instead use data from October, September, and August.
Issue: Currently, the rolling MAD for December still includes the outlier from November instead of ignoring it.
I tried modifying this approach so that when calculating the rolling MAD for a given month, it excludes previously flagged outliers from the window:
window_size <- 3
k <- 1
data %>%
mutate(attempts_rolling_median = rollapply(Attempts, window_size, median, fill = NA, align = "right"),
attempts_abs_deviation = abs(Attempts - attempts_rolling_median),
attempts_rolling_mad = rollapply(attempts_abs_deviation, window_size, median, fill = NA, align = "right", na.rm = TRUE),
) %>%
mutate(attempts_lower_bound = attempts_rolling_median - k * attempts_rolling_mad,
attempts_upper_bound = attempts_rolling_median + k * attempts_rolling_mad) %>%
mutate(attempts_anomaly = ifelse(Attempts < attempts_lower_bound, TRUE, FALSE)) %>%
mutate(
attempts_cleaned = case_when(attempts_anomaly == TRUE ~ NA,
.default = Attempts),
attempts_rolling_median_cleaned = rollapply(attempts_cleaned, window_size, median, fill = NA, align = "right", na.rm = TRUE)
) %>%
mutate(attempts_abs_deviation_1 = abs(attempts_cleaned - attempts_rolling_median_cleaned),
attempts_rolling_mad_1 = rollapply(attempts_abs_deviation_1, window_size, median, fill = NA, align = "right", na.rm = TRUE),
) %>%
mutate(attempts_lower_bound_1 = attempts_rolling_median_cleaned - k * attempts_rolling_mad_1,
attempts_upper_bound_1 = attempts_rolling_median_cleaned + k * attempts_rolling_mad_1) %>%
mutate(attempts_anomaly_1 = ifelse(Attempts < attempts_lower_bound_1, TRUE, FALSE))
My current approach manually replaces outliers with NA before recalculating the rolling median. This does not feel efficient, especially for large datasets.
Question: Is there a more efficient way to calculate the rolling MAD while automatically ignoring previously detected outliers?
Any suggestions or alternative approaches would be greatly appreciated.
-
\$\begingroup\$ Your strategy of ignoring outliers from one window in computing statistics and outliers for subsequent windows is statistically questionable. It could produce effects such as a majority of all the data being assigned to be outliers. Also, It does not prevent the possibility that some data points are included as non-outliers in some (earlier) windows, but excluded as outliers from other, later windows -- this latter is not necessarily a problem in itself, but it sounds like you might think you are avoiding things like that. \$\endgroup\$John Bollinger– John Bollinger2025年03月18日 04:03:01 +00:00Commented Mar 18 at 4:03
1 Answer 1
Is there a more efficient way to calculate the rolling MAD while automatically ignoring previously detected outliers?
Near as I can tell, that poor median() function keeps
considering each slightly moved window from scratch.
Prefer the rollmean or
roll_median
package.
Right now window_size
is "small", \3ドル\$,
but since a naïve maintainer might make it bigger,
we wouldn't want it to be a factor in the
big-Oh
complexity.
A sensible algorithm would maintain a heap,
adding a new element and removing oldest at each iteration.
pipeline
Replacing outliers with NA is a fair strategy. But it might make more sense to discard them entirely, preserving only inliers.
Consume a single row to initialize the median value, and
initialize inliers
to consist of just that single value.
For each new row,
- test whether it is within current bounds, discarding if outlier
- if inlier: append to
inliers
; compute new trailing window median of inliers, along with bounds
In this way it's as though an outlier event "never happened", and it cannot disturb the median value you compute.
If you do not need all inlier values at the end, there's an opportunity to use a circular buffer to save memory.