I have two dataframes
(x
and y
) that I need to join, conditional on the timestamp in x
falling within the time interval of two columns in y
. I've accomplished this using data.table::foverlaps()
by adapting some of the code in this stackexchange question), but in order to get it to work on my data I had to set the key
of data.table
x
, which according to the documentation of foverlaps()
, isn't supposed to be necessary.
Am I using this foverlaps()
incorrectly? If so, new ideas for how to accomplish this data merging task are welcome.
library(data.table)
Data snippets
x <- structure(list(TagID = c(20161L, 23055L, 45428L, 2627L), DateTimePST = structure(c(1415481096,
1380768444, 1474560076, 1511384035), class = c("POSIXct", "POSIXt"
), tzone = "Pacific/Pitcairn"), Receiver = c(102140L, 112568L,
112568L, 121907L)), class = "data.frame", row.names = c(NA, -4L
))
y <- structure(list(Station = c("YBAAG4", "YBAAG4", "YBCSSW", "YBCSSW",
"YBBCD", "YBAWW"), Receiver = c(102140L, 102140L, 112568L, 112568L,
121907L, 121907L), Start = structure(c(1464979020, 1409256300,
1369945920, 1470761034, 1505494980, 1409246700), class = c("POSIXct",
"POSIXt"), tzone = "Pacific/Pitcairn"), End = structure(c(1473357300,
1421878500, 1382638020, 1479838293, 1513282440, 1421871360), class = c("POSIXct",
"POSIXt"), tzone = "Pacific/Pitcairn")), class = "data.frame", row.names = c(NA,
-6L))
# preview data
> x
TagID DateTimePST Receiver
1 20161 2014年11月08日 13:11:36 102140
2 23055 2013年10月02日 18:47:24 112568
3 45428 2016年09月22日 08:01:16 112568
4 2627 2017年11月22日 12:53:55 121907
> y
Station Receiver Start End
1 YBAAG4 102140 2016年06月03日 10:37:00 2016年09月08日 09:55:00
2 YBAAG4 102140 2014年08月28日 12:05:00 2015年01月21日 14:15:00
3 YBCSSW 112568 2013年05月30日 12:32:00 2013年10月24日 10:07:00
4 YBCSSW 112568 2016年08月09日 08:43:54 2016年11月22日 10:11:33
5 YBBCD 121907 2017年09月15日 09:03:00 2017年12月14日 12:14:00
6 YBAWW 121907 2014年08月28日 09:25:00 2015年01月21日 12:16:00
Because some Receiver
numbers are associated with more than one Station
, it is important to merge these two datasets on the timestamp (DateTimePST
), not on the receiver number.
Prep data
x <- as.data.table(x); x$Start = x$DateTimePST; x$End = x$DateTimePST # needs these start and end columns, otherwise foverlaps throws an error
y <- as.data.table(y)
# set keys: if I omit setting the key on x, forverlap() throws an error
setkey(y, Start, End); setkey(x, Start, End)
foverlaps()
join, then discard incorrect Receiver
pairs
result <- data.frame(foverlaps(x, y, type = "within"))
result <- result[result$Receiver == result$i.Receiver, ] # this filters down to the correct receiver matches
Clean up results
rm_col <- c("i.Receiver", "i.Start", "i.End")
result <- result[ , !(colnames(result) %in% rm_col)]
result
Station Receiver Start End TagID DateTimePST
1 YBCSSW 112568 2013年05月30日 12:32:00 2013年10月24日 10:07:00 23055 2013年10月02日 18:47:24
3 YBAAG4 102140 2014年08月28日 12:05:00 2015年01月21日 14:15:00 20161 2014年11月08日 13:11:36
4 YBCSSW 112568 2016年08月09日 08:43:54 2016年11月22日 10:11:33 45428 2016年09月22日 08:01:16
5 YBBCD 121907 2017年09月15日 09:03:00 2017年12月14日 12:14:00 2627 2017年11月22日 12:53:55
Wrap in a function:
getStation <- function(x, y) {
x <- as.data.table(x); x$Start = x$DateTimePST; x$End = x$DateTimePST
y <- as.data.table(y)
setkey(y, Start, End); setkey(x, Start, End)
result <- data.frame(foverlaps(x, y, type = "within"))
result <- result[result$Receiver == result$i.Receiver, ]
result <- result[ , !(colnames(result) %in% c("i.Receiver", "i.Start", "i.End"))]
}
res <- getStation(x, y)
1 Answer 1
You'll need to use by.x
to get around setting x
's key.
Also, note that foverlaps
can merge on any number of keys, and then do the overlaps only on the final two. What that means here is that you can do the x.Receiver == i.Receiver
filter up-front, which also simplifies the clean-up step.
x = as.data.table(x)
x[ , end := DateTimePST]
y = as.data.table(y)
setkey(y, Receiver, Start, End)
result = foverlaps(x, y, by.x = c('Receiver', 'DateTimePST', 'end'), type = 'within')
result[ , end := NULL]
setDF(result)
I do find it a bit awkward/strange that you've got to define end
when by.x = c('Receiver', 'DateTimePST', 'DateTimePST')
would appear to be fine, and have filed a feature request.
If you don't need DateTimePST
in your output, you might find the non-equi-join version more compact / easier to read:
y[x, on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]
# Station Receiver Start End
# 1: YBAAG4 102140 2014年11月08日 13:11:36 2014年11月08日 13:11:36
# 2: YBCSSW 112568 2013年10月02日 18:47:24 2013年10月02日 18:47:24
# 3: YBCSSW 112568 2016年09月22日 08:01:16 2016年09月22日 08:01:16
# 4: YBBCD 121907 2017年11月22日 12:53:55 2017年11月22日 12:53:55
# TagID
# 1: 20161
# 2: 23055
# 3: 45428
# 4: 2627
Note the order y[x
-- we're using x
to "look up" rows of y
.
If you do need DateTimePST
, it's just a bit of an ugly extension:
y[x, c(.SD, list(DateTimePST=DateTimePST)),
on = .(Receiver == Receiver, Start < DateTimePST, End > DateTimePST)]
Finally, we can think of this as adding columns to x
by a non-equi-join with y
by reversing the order:
x[y, `:=`(Start = i.Start, End = i.End, Station = i.Station),
on = .(Receiver == Receiver, DateTimePST > Start, DateTimePST < End)]
-
\$\begingroup\$ Super helpful options, thank you! I do need
DateTimePST
, so the last two methods were both good ones. One note on the "ugly extension" one, that only works for me when I setx$Start = x$DateTimePST
, otherwise I get an error: "Error in
[.data.table(y, x, c(.SD, list(DateTimePST = DateTimePST)), : column(s) not found: Start
" . I'm still learningdata.table
though \$\endgroup\$Von– Von2019年07月23日 05:01:11 +00:00Commented Jul 23, 2019 at 5:01