Merge dataframes on timestamps and time intervals using data.table in R

Question 1

I have two dataframes (x and y) that I need to join, conditional on the timestamp in x falling within the time interval of two columns in y. I've accomplished this using data.table::foverlaps() by adapting some of the code in this stackexchange question), but in order to get it to work on my data I had to set the key of data.table x, which according to the documentation of foverlaps(), isn't supposed to be necessary.

Am I using this foverlaps()incorrectly? If so, new ideas for how to accomplish this data merging task are welcome.

library(data.table)

Data snippets

x <- structure(list(TagID = c(20161L, 23055L, 45428L, 2627L), DateTimePST = structure(c(1415481096, 
1380768444, 1474560076, 1511384035), class = c("POSIXct", "POSIXt"
), tzone = "Pacific/Pitcairn"), Receiver = c(102140L, 112568L, 
112568L, 121907L)), class = "data.frame", row.names = c(NA, -4L
))
y <- structure(list(Station = c("YBAAG4", "YBAAG4", "YBCSSW", "YBCSSW", 
"YBBCD", "YBAWW"), Receiver = c(102140L, 102140L, 112568L, 112568L, 
121907L, 121907L), Start = structure(c(1464979020, 1409256300, 
1369945920, 1470761034, 1505494980, 1409246700), class = c("POSIXct", 
"POSIXt"), tzone = "Pacific/Pitcairn"), End = structure(c(1473357300, 
1421878500, 1382638020, 1479838293, 1513282440, 1421871360), class = c("POSIXct", 
"POSIXt"), tzone = "Pacific/Pitcairn")), class = "data.frame", row.names = c(NA, 
-6L))
# preview data
> x
 TagID DateTimePST Receiver
1 20161 2014年11月08日 13:11:36 102140
2 23055 2013年10月02日 18:47:24 112568
3 45428 2016年09月22日 08:01:16 112568
4 2627 2017年11月22日 12:53:55 121907
> y
 Station Receiver Start End
1 YBAAG4 102140 2016年06月03日 10:37:00 2016年09月08日 09:55:00
2 YBAAG4 102140 2014年08月28日 12:05:00 2015年01月21日 14:15:00
3 YBCSSW 112568 2013年05月30日 12:32:00 2013年10月24日 10:07:00
4 YBCSSW 112568 2016年08月09日 08:43:54 2016年11月22日 10:11:33
5 YBBCD 121907 2017年09月15日 09:03:00 2017年12月14日 12:14:00
6 YBAWW 121907 2014年08月28日 09:25:00 2015年01月21日 12:16:00

Because some Receiver numbers are associated with more than one Station, it is important to merge these two datasets on the timestamp (DateTimePST), not on the receiver number.

Prep data

x <- as.data.table(x); x$Start = x$DateTimePST; x$End = x$DateTimePST # needs these start and end columns, otherwise foverlaps throws an error
y <- as.data.table(y)
# set keys: if I omit setting the key on x, forverlap() throws an error
setkey(y, Start, End); setkey(x, Start, End)

`foverlaps()` join, then discard incorrect `Receiver` pairs

result <- data.frame(foverlaps(x, y, type = "within"))
result <- result[result$Receiver == result$i.Receiver, ] # this filters down to the correct receiver matches

Clean up results

rm_col <- c("i.Receiver", "i.Start", "i.End")
result <- result[ , !(colnames(result) %in% rm_col)]
result
Station Receiver Start End TagID DateTimePST
1 YBCSSW 112568 2013年05月30日 12:32:00 2013年10月24日 10:07:00 23055 2013年10月02日 18:47:24
3 YBAAG4 102140 2014年08月28日 12:05:00 2015年01月21日 14:15:00 20161 2014年11月08日 13:11:36
4 YBCSSW 112568 2016年08月09日 08:43:54 2016年11月22日 10:11:33 45428 2016年09月22日 08:01:16
5 YBBCD 121907 2017年09月15日 09:03:00 2017年12月14日 12:14:00 2627 2017年11月22日 12:53:55

Wrap in a function:

getStation <- function(x, y) {
 x <- as.data.table(x); x$Start = x$DateTimePST; x$End = x$DateTimePST
 y <- as.data.table(y)
 setkey(y, Start, End); setkey(x, Start, End)
 result <- data.frame(foverlaps(x, y, type = "within"))
 result <- result[result$Receiver == result$i.Receiver, ]
 result <- result[ , !(colnames(result) %in% c("i.Receiver", "i.Start", "i.End"))]
}
res <- getStation(x, y)

Question 2

You'll need to use by.x to get around setting x's key.

Also, note that foverlaps can merge on any number of keys, and then do the overlaps only on the final two. What that means here is that you can do the x.Receiver == i.Receiver filter up-front, which also simplifies the clean-up step.

x = as.data.table(x)
x[ , end := DateTimePST]
y = as.data.table(y)
setkey(y, Receiver, Start, End)
result = foverlaps(x, y, by.x = c('Receiver', 'DateTimePST', 'end'), type = 'within')
result[ , end := NULL]
setDF(result)

I do find it a bit awkward/strange that you've got to define end when by.x = c('Receiver', 'DateTimePST', 'DateTimePST') would appear to be fine, and have filed a feature request.

If you don't need DateTimePST in your output, you might find the non-equi-join version more compact / easier to read:

y[x, on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]
# Station Receiver Start End
# 1: YBAAG4 102140 2014年11月08日 13:11:36 2014年11月08日 13:11:36
# 2: YBCSSW 112568 2013年10月02日 18:47:24 2013年10月02日 18:47:24
# 3: YBCSSW 112568 2016年09月22日 08:01:16 2016年09月22日 08:01:16
# 4: YBBCD 121907 2017年11月22日 12:53:55 2017年11月22日 12:53:55
# TagID
# 1: 20161
# 2: 23055
# 3: 45428
# 4: 2627

Note the order y[x -- we're using x to "look up" rows of y.

If you do need DateTimePST, it's just a bit of an ugly extension:

y[x, c(.SD, list(DateTimePST=DateTimePST)),
 on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]

Finally, we can think of this as adding columns to x by a non-equi-join with y by reversing the order:

x[y, `:=`(Start = i.Start, End = i.End, Station = i.Station),
 on = .(Receiver == Receiver, DateTimePST > Start, DateTimePST < End)]

Question 3

Super helpful options, thank you! I do need DateTimePST, so the last two methods were both good ones. One note on the "ugly extension" one, that only works for me when I set x$Start = x$DateTimePST, otherwise I get an error: "Error in [.data.table(y, x, c(.SD, list(DateTimePST = DateTimePST)), : column(s) not found: Start" . I'm still learning data.table though

MichaelChirico MichaelChirico 1764 bronze badges · Accepted Answer · 2019-07-23 00:21:37Z

You'll need to use by.x to get around setting x's key.

Also, note that foverlaps can merge on any number of keys, and then do the overlaps only on the final two. What that means here is that you can do the x.Receiver == i.Receiver filter up-front, which also simplifies the clean-up step.

x = as.data.table(x)
x[ , end := DateTimePST]
y = as.data.table(y)
setkey(y, Receiver, Start, End)
result = foverlaps(x, y, by.x = c('Receiver', 'DateTimePST', 'end'), type = 'within')
result[ , end := NULL]
setDF(result)

I do find it a bit awkward/strange that you've got to define end when by.x = c('Receiver', 'DateTimePST', 'DateTimePST') would appear to be fine, and have filed a feature request.

If you don't need DateTimePST in your output, you might find the non-equi-join version more compact / easier to read:

y[x, on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]
# Station Receiver Start End
# 1: YBAAG4 102140 2014年11月08日 13:11:36 2014年11月08日 13:11:36
# 2: YBCSSW 112568 2013年10月02日 18:47:24 2013年10月02日 18:47:24
# 3: YBCSSW 112568 2016年09月22日 08:01:16 2016年09月22日 08:01:16
# 4: YBBCD 121907 2017年11月22日 12:53:55 2017年11月22日 12:53:55
# TagID
# 1: 20161
# 2: 23055
# 3: 45428
# 4: 2627

Note the order y[x -- we're using x to "look up" rows of y.

If you do need DateTimePST, it's just a bit of an ugly extension:

y[x, c(.SD, list(DateTimePST=DateTimePST)),
 on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]

Finally, we can think of this as adding columns to x by a non-equi-join with y by reversing the order:

x[y, `:=`(Start = i.Start, End = i.End, Station = i.Station),
 on = .(Receiver == Receiver, DateTimePST > Start, DateTimePST < End)]

Super helpful options, thank you! I do need DateTimePST, so the last two methods were both good ones. One note on the "ugly extension" one, that only works for me when I set x$Start = x$DateTimePST, otherwise I get an error: "Error in [.data.table(y, x, c(.SD, list(DateTimePST = DateTimePST)), : column(s) not found: Start" . I'm still learning data.table though

Stack Exchange Network

Merge dataframes on timestamps and time intervals using data.table in R

Data snippets

Prep data

`foverlaps()` join, then discard incorrect `Receiver` pairs

Clean up results

Wrap in a function:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Merge dataframes on timestamps and time intervals using data.table in R

Data snippets

Prep data

foverlaps() join, then discard incorrect Receiver pairs

Clean up results

Wrap in a function:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`foverlaps()` join, then discard incorrect `Receiver` pairs