I am trying to fill values based on group, in my case id
. I would like to fill the missing values according to the available date
info for each id
.
id date 1 1 23-04 2 1 23-04 3 1 <NA> 4 1 <NA> 5 2 24-04 6 2 <NA> 7 2 <NA> 8 2 <NA> 9 3 23-04 10 3 <NA> 11 3 <NA> 12 3 <NA> 13 4 <NA> 14 4 <NA> 15 4 <NA> 16 4 <NA>
What I need is:
id date 1 1 23-04 2 1 23-04 3 1 23-04 4 1 23-04 5 2 24-04 6 2 24-04 7 2 24-04 8 2 24-04 9 3 23-04 10 3 23-04 11 3 23-04 12 3 23-04 13 4 <NA> 14 4 <NA> 15 4 <NA> 16 4 <NA>
I figured out a loop, but I would like to avoid it because my data has 23 millions rows:
for(i in 2:nrow(dta)){
if(dta$id[i-1] == dta$id[i])
{
dta$date[i] = dta$date[i-1]
}
}
I cannot figure out how to translate this into dplyr syntax:
dta = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"
), class = "factor"), date = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, NA, NA, NA, NA), .Label = c("23-04",
"24-04"), class = "factor")), .Names = c("id", "date"), row.names = c(NA,
-16L), class = "data.frame")
2 Answers 2
In R this is usually solved using the na.locf
(Last Observation Carried Forward) function from the zoo package.
See also here:
- Fill in NA based on the last non-NA value for each group in R
- Using dplyr window-functions to make trailing values
# test data
x <- read.table(text="id;date
1;23-04
1;23-04
1;NA
1;NA
2;24-04
2;NA
2;NA
2;NA
3;23-04
3;NA
3;NA
3;NA
4;NA
4;NA
4;NA
4;NA", header=TRUE, sep=";")
library("dplyr")
library("zoo")
x %>% group_by(id) %>% transmute(date=na.locf(date, na.rm=FALSE))
Source: local data frame [16 x 2]
Groups: id
id date
1 1 23-04
2 1 23-04
3 1 23-04
4 1 23-04
5 2 24-04
6 2 24-04
7 2 24-04
8 2 24-04
9 3 23-04
10 3 23-04
11 3 23-04
12 3 23-04
13 4 NA
14 4 NA
15 4 NA
16 4 NA
Another option are rolling self-joins supported by the data.table package (see here).
-
\$\begingroup\$ interesting I will try it on the whole sample \$\endgroup\$giac– giac2015年07月27日 15:52:08 +00:00Commented Jul 27, 2015 at 15:52
-
\$\begingroup\$ It would be great to see how we can do it with data.table and nomatch=roll or roll=Inf. I'm not able to find any example. \$\endgroup\$skan– skan2017年07月11日 10:03:24 +00:00Commented Jul 11, 2017 at 10:03
There is no need for packages. This is easily and efficiently done with basic syntax.
# Create data.table
dtk <- data.table( var=c(NA,NA,1,NA,NA,2,3,NA,NA,NA,NA,4,5,NA,NA,NA,NA,NA,6,7,NA,NA), group=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B"))
# Backward fill
dtk[,order:= (1:.N)];
setorder(dtk,-order);
dtk[, var_backward_fill := var[1], by= .( group , cumsum(!is.na(var))) ];
setorder(dtk, order);
dtk[,order:= NULL];
# Forward fill
dtk[, var_forward_fill := var[1], by= .( group , cumsum(!is.na(var))) ]