I would like to speed up the following code in R. This is a loop to define new individuals (column newID
) when the variable change
is equal to 1. Any idea on how to improve the loop will be greatly appreciated.
Here is the code:
## Build the data frame
dat <- expand.grid(x = 1:1000, ID = as.character(seq(0, 4000, 1)))
dat$change <- 0
dat[which(dat$x == 1), c("change")] <- 1
dat[which(dat$x == 300), c("change")] <- 1
dat[which(dat$x == 700), c("change")] <- 1
dat[1, c("change")] <- 0
## Add a column "newID"
dat$newID <- NA
index <- c(1, which(dat$change == 1), nrow(dat))
j <- 1
i <- 1
system.time(while (j < length(index)){
print(paste(j, "/", length(index), sep = " "))
i <- ifelse((j > 1) && (dat[index[j], c("ID")] != dat[index[j - 1], c("ID")]), 1, i)
## print(i)
if(j == length(index) - 1){
dat[seq(index[j], index[j + 1], by = 1), c("newID")] <- paste("Ind ", dat[index[j], c("ID")], "|", i, sep="")
} else{
dat[seq(index[j], index[j + 1] - 1, by = 1), c("newID")] <- paste("Ind ", dat[index[j], c("ID")], "|", i, sep="")
}
j <- j + 1
i <- i + 1
})
## summary(dat)
Here is an example:
The input data frame has 3 columns. In particular, ID
is the ID number of each individual and change
takes the value of 1 when the individual is renewed.
x ID change
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 1
6 6 0 0
7 7 0 1
8 8 0 0
9 9 0 0
10 10 0 0
11 1 1 1
12 2 1 0
13 3 1 0
14 4 1 0
15 5 1 1
16 6 1 0
17 7 1 1
18 8 1 0
19 9 1 0
20 10 1 0
21 1 2 1
22 2 2 0
23 3 2 0
24 4 2 0
25 5 2 1
26 6 2 0
27 7 2 1
28 8 2 0
29 9 2 0
30 10 2 0
The variable newID
is created as follows:
When change
is equal to 1, newID
takes the old ID number and the increment value. Thus, in the example, the expected result is:
x ID change newID
1 1 0 0 Ind 0|1
2 2 0 0 Ind 0|1
3 3 0 0 Ind 0|1
4 4 0 0 Ind 0|1
5 5 0 1 Ind 0|2
6 6 0 0 Ind 0|2
7 7 0 1 Ind 0|3
8 8 0 0 Ind 0|3
9 9 0 0 Ind 0|3
10 10 0 0 Ind 0|3
11 1 1 1 Ind 1|1
12 2 1 0 Ind 1|1
13 3 1 0 Ind 1|1
14 4 1 0 Ind 1|1
15 5 1 1 Ind 1|2
16 6 1 0 Ind 1|2
17 7 1 1 Ind 1|3
18 8 1 0 Ind 1|3
19 9 1 0 Ind 1|3
20 10 1 0 Ind 1|3
21 1 2 1 Ind 2|1
22 2 2 0 Ind 2|1
23 3 2 0 Ind 2|1
24 4 2 0 Ind 2|1
25 5 2 1 Ind 2|2
26 6 2 0 Ind 2|2
27 7 2 1 Ind 2|3
28 8 2 0 Ind 2|3
29 9 2 0 Ind 2|3
30 10 2 0 Ind 2|3
1 Answer 1
Do not use explicit loops unless you absolutely have to
Many functions in R are vectorized and your code will be much faster if you leverage this instead of writing your own loops.
For example, you can compute the first part of your newID
with a single paste()
call:
dat$newID <- paste("Ind", dat$ID)
x ID change newID
1 1 0 0 Ind 0
2 2 0 0 Ind 0
3 3 0 0 Ind 0
4 4 0 0 Ind 0
5 5 0 1 Ind 0
6 6 0 0 Ind 0
7 7 0 1 Ind 0
8 8 0 0 Ind 0
9 9 0 0 Ind 0
10 10 0 0 Ind 0
11 1 1 1 Ind 1
12 2 1 0 Ind 1
13 3 1 0 Ind 1
14 4 1 0 Ind 1
15 5 1 1 Ind 1
16 6 1 0 Ind 1
17 7 1 1 Ind 1
18 8 1 0 Ind 1
19 9 1 0 Ind 1
...
The second part of your newID
is simply the cumulative sum of change
The most tricky part here is to reset the counter each time the ID
changes. A way to do this is to use the function by
, which will execute a given function on a group of rows depending on the values of a grouping variable (here ID
):
by(dat, dat$ID, function(x) {
cumsum(x$change)
})
dat$ID: 0
[1] 0 0 0 0 1 1 2 2 2 2
---------------------------------------------------------------------------------------------
dat$ID: 1
[1] 1 1 1 1 2 2 3 3 3 3
---------------------------------------------------------------------------------------------
dat$ID: 2
[1] 1 1 1 1 2 2 3 3 3 3
The two issues here are:
by
returns a list so we have to useunlist()
before trying to put the result in a data.frame column
dat$newID <- unlist(by(dat, dat$ID, function(x) {
cumsum(x$change)
}))
- because
change
doesn't start with 1, the values for the firstID
are shifted by one. We can fix this by changing the first value manually.
Put everything together
dat$change[1] <- 1
dat$newID <- unlist(by(dat, dat$ID, function(x) {
cumsum(x$change)
}))
dat$newID <- paste0("Ind ", dat$ID, "|", dat$newID)
And you get exactly the output you were asking for, without any explicit for
loops!