I've written the following code to check if a value in a data frame changes. I'm looking at the last 5 values. If there was no change at all I want my code to return 1, if a single one (or multiple) of the last 5 are different to the value that is being checked return 0. Finally I want the returned values in a new column in my data frame.
Here's my code so far. It works but I think there is a nicer (and more clean) way to do it.
mydata <- data.frame("id" = 1:100, "ta" = c(sample(x = c(-5:20), size = 94, replace = T), rep(1,6))) # include a repetition to check if code works
nums <- mydata$id # create a dummy for iteration
qc_dummy <- vector(mode = "list", length = length(nums)) # create a dummy vector for the values computed in the for loop
for(i in 1:length(nums)) {
qc_dummy[[i]] <- ifelse(mydata[nums[i], 2] - mydata[nums[i-1], 2] == 0,
ifelse(mydata[nums[i], 2] - mydata[nums[i-2], 2] == 0,
ifelse(mydata[nums[i], 2] - mydata[nums[i-3], 2] == 0,
ifelse(mydata[nums[i], 2] - mydata[nums[i-4], 2] == 0,
ifelse(mydata[nums[i], 2] - mydata[nums[i-5], 2] == 0, 1, 0) ,0), 0) ,0) ,0)
}
mydata$qc1 <- as.vector(c(0,unlist(qc_dummy))) # first value of list is skipped by unlist (logi(0)) -> add 0
3 Answers 3
I reduced the example data, for easier viewing
# new example data:
mydata <- data.frame(ta = 1:13)
mydata[2:3, 1] <- 1L
mydata[6:12, 1] <- 2L
n <- 3 # how many equal values we need
require(data.table)
setDT(mydata) # convert to data.table
mydata
mydata[, mathcPrev := fifelse((ta - shift(ta, 1)) == 0L, T, F, F)]
mydata[, g := cumsum(!mathcPrev)] # grouping value, if value has changed
mydata[, count := cumsum(mathcPrev), by = g]
mydata[, qc2 := fifelse(count >= n, 1L, 0L)]
mydata
# ta mathcPrev g count qc2
# 1: 1 FALSE 1 0 0
# 2: 1 TRUE 1 1 0
# 3: 1 TRUE 1 2 0
# 4: 4 FALSE 2 0 0
# 5: 5 FALSE 3 0 0
# 6: 2 FALSE 4 0 0
# 7: 2 TRUE 4 1 0
# 8: 2 TRUE 4 2 0
# 9: 2 TRUE 4 3 1
# 10: 2 TRUE 4 4 1
# 11: 2 TRUE 4 5 1
# 12: 2 TRUE 4 6 1
# 13: 13 FALSE 5 0 0
So, the idea is to create index mathcPrev
, that shows if this value matches previous, and then we can count how many equal values we have in a row.
To my understanding you want a new column qc1
that takes value 1 if the current element matches the previous 5 elements and takes value 0 otherwise.
This feels like a great application of run-length encoding. I'll borrow the great example data from @minem:
mydata <- data.frame(ta = 1:13)
mydata[2:3, 1] <- 1L
mydata[6:12, 1] <- 2L
mydata$ta
# [1] 1 1 1 4 5 2 2 2 2 2 2 2 13
The run-length encoding tells us how many times each value is repeated in a row:
rle(mydata$ta)
# Run Length Encoding
# lengths: int [1:5] 3 1 1 7 1
# values : int [1:5] 1 4 5 2 13
We read from this output that we have 5 runs: 1 repeated 3 times, 4 repeated 1 time, 5 repeated 1 time, 2 repeated 7 times, and 13 repeated 1 time. For each run, we know the first 5 values won't be preceded by 5 identical elements (0
in the output), while elements 6 and onward will (1
in the output). So the number of 0
s at the beginning of each run is:
with(rle(mydata$ta), pmin(lengths, 5))
# [1] 3 1 1 5 1
And the number of 1
s at the end of each run is:
with(rle(mydata$ta), pmax(lengths-5, 0))
# [1] 0 0 0 2 0
So we just need to interleave these two vectors within a call to rep
to yield your eventual one-liner for this operation:
mydata$qc1 <- with(rle(mydata$ta),
rep(rep(0:1, length(values)), c(rbind(pmin(lengths, 5), pmax(lengths-5, 0)))))
mydata
# ta qc1
# 1 1 0
# 2 1 0
# 3 1 0
# 4 4 0
# 5 5 0
# 6 2 0
# 7 2 0
# 8 2 0
# 9 2 0
# 10 2 0
# 11 2 1
# 12 2 1
# 13 13 0
If you were planning to do this with a bunch of different window sizes, then a function would make the most sense, which would take the window size as an argument. Here I'll split up the calculation into smaller pieces for readability:
window.repeat <- function(vals, window.size) {
r <- rle(vals)
num.run <- length(r$values)
run.0s <- with(r, pmin(lengths, window.size))
run.1s <- with(r, pmax(lengths-window.size, 0))
rep(rep(0:1, num.run), c(rbind(run.0s, run.1s)))
}
Now we could, for instance, label each element with whether it had 2 or more repeats before it:
mydata$qc1 <- window.repeat(mydata$ta, 2)
mydata
# ta qc1
# 1 1 0
# 2 1 0
# 3 1 1
# 4 4 0
# 5 5 0
# 6 2 0
# 7 2 0
# 8 2 1
# 9 2 1
# 10 2 1
# 11 2 1
# 12 2 1
# 13 13 0
If you are willing to use additional packages, then this could be cleanly handled by performing a rolling apply on your vector. For instance, you could compute if the rolling minimum of your vector with window length 6 equals the rolling maximum with the same window length:
library(RcppRoll)
as.numeric(roll_min(mydata$ta, 6) == roll_max(mydata$ta, 6))
# [1] 0 0 0 0 0 1 1 0
All we need to do is add 0 for the first 5 elements (which have been removed from this calculation), yielding our one-liner:
mydata$qc1 <- c(rep(0, 5), roll_min(mydata$ta, 6) == roll_max(mydata$ta, 6))
mydata
# ta qc1
# 1 1 0
# 2 1 0
# 3 1 0
# 4 4 0
# 5 5 0
# 6 2 0
# 7 2 0
# 8 2 0
# 9 2 0
# 10 2 0
# 11 2 1
# 12 2 1
# 13 13 0
You could also wrap this into a function to allow variable window sizes:
window.repeat <- function(vals, window.size) {
c(rep(0, window.size), roll_min(vals, window.size+1) == roll_max(vals, window.size+1))
}