I have three data frames A
, B
and C
:
set.seed(0)
N <- 5
A<-data.frame(cbind(date=c(2,3,5,1), x=NA, id=sample(letters[1:2], 4, replace=T)), stringsAsFactors = F)
B<-data.frame(cbind(date=1:N, y=runif(N)), stringsAsFactors = F)
C<-data.frame(cbind(date=1:N, z=100+sample(N), id=rep(letters[1:2], N, replace=T)), stringsAsFactors = F)
C$z<-as.numeric(C$z)
and they look like this:
A
B
C
> A
date x id
1 2 <NA> b
2 3 <NA> a
3 5 <NA> a
4 1 <NA> b
> B
date y
1 1 0.9082
2 2 0.2017
3 3 0.8984
4 4 0.9447
5 5 0.6608
> C
date z id
1 1 104 a
2 2 101 b
3 3 105 a
4 4 103 b
5 5 102 a
6 1 104 b
7 2 101 a
8 3 105 b
9 4 103 a
10 5 102 b
I would like to fill in A$x
with a function of y
and z
, let's say, for instance, the product of B$y*C$z
for the corresponding dates
and ids
, like this:
for (i in 1:length(A$x)){
A$x[i] <- B$y[A$date[i] == B$date] * C$z[A$date[i] == C$date & A$id[i] == C$id]
}
> A
date x id
1 2 20.369875034783 b
2 3 94.3309169216082 a
3 5 67.4013748336583 a
4 1 94.4536101594567 b
This is a very bad idea for a data set with many elements (obviously), as it is slow. I also tried with match()
and which()
, but there isn't any significant speed up, I believe. Maybe I could use dcast()
, after merging everything into one data frame, but I would prefer not to merge the data frames at all (if this can be avoided).
Is it possible to do it more efficiently?
1 Answer 1
Although you mention that you don't want to merge the data frames,
your for
loop could be replaced with this single line using merge:
with(merge(A, merge(B, C)), data.frame(date, x=y * z, id))
Given your example of A
, B
, C
, this is returns a data frame:
date x id 1 1 94.45361 b 2 2 20.36988 b 3 3 94.33092 a 4 5 67.40137 a
The problem with the for loop is that it's discouraged in R because it's inefficient. Using merge should be fast. I don't think you can get around merging, as the meaning of your logic is in fact merging.