Speed up a while loop that conditionally creates a new variable in R

Question 1

I would like to speed up the following code in R. This is a loop to define new individuals (column newID) when the variable change is equal to 1. Any idea on how to improve the loop will be greatly appreciated.

Here is the code:

## Build the data frame 
dat <- expand.grid(x = 1:1000, ID = as.character(seq(0, 4000, 1)))
dat$change <- 0
dat[which(dat$x == 1), c("change")] <- 1
dat[which(dat$x == 300), c("change")] <- 1
dat[which(dat$x == 700), c("change")] <- 1
dat[1, c("change")] <- 0
## Add a column "newID"
dat$newID <- NA
index <- c(1, which(dat$change == 1), nrow(dat))
j <- 1
i <- 1
system.time(while (j < length(index)){
 print(paste(j, "/", length(index), sep = " "))
 i <- ifelse((j > 1) && (dat[index[j], c("ID")] != dat[index[j - 1], c("ID")]), 1, i)
 ## print(i)
 if(j == length(index) - 1){
 dat[seq(index[j], index[j + 1], by = 1), c("newID")] <- paste("Ind ", dat[index[j], c("ID")], "|", i, sep="")
 } else{
 dat[seq(index[j], index[j + 1] - 1, by = 1), c("newID")] <- paste("Ind ", dat[index[j], c("ID")], "|", i, sep="")
 }
 j <- j + 1
 i <- i + 1
})
## summary(dat)

Here is an example:

The input data frame has 3 columns. In particular, ID is the ID number of each individual and change takes the value of 1 when the individual is renewed.

 x ID change
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 1
6 6 0 0
7 7 0 1
8 8 0 0
9 9 0 0
10 10 0 0
11 1 1 1
12 2 1 0
13 3 1 0
14 4 1 0
15 5 1 1
16 6 1 0
17 7 1 1
18 8 1 0
19 9 1 0
20 10 1 0
21 1 2 1
22 2 2 0
23 3 2 0
24 4 2 0
25 5 2 1
26 6 2 0
27 7 2 1
28 8 2 0
29 9 2 0
30 10 2 0

The variable newID is created as follows: When change is equal to 1, newID takes the old ID number and the increment value. Thus, in the example, the expected result is:

 x ID change newID
1 1 0 0 Ind 0|1
2 2 0 0 Ind 0|1
3 3 0 0 Ind 0|1
4 4 0 0 Ind 0|1
5 5 0 1 Ind 0|2
6 6 0 0 Ind 0|2
7 7 0 1 Ind 0|3
8 8 0 0 Ind 0|3
9 9 0 0 Ind 0|3
10 10 0 0 Ind 0|3
11 1 1 1 Ind 1|1
12 2 1 0 Ind 1|1
13 3 1 0 Ind 1|1
14 4 1 0 Ind 1|1
15 5 1 1 Ind 1|2
16 6 1 0 Ind 1|2
17 7 1 1 Ind 1|3
18 8 1 0 Ind 1|3
19 9 1 0 Ind 1|3
20 10 1 0 Ind 1|3
21 1 2 1 Ind 2|1
22 2 2 0 Ind 2|1
23 3 2 0 Ind 2|1
24 4 2 0 Ind 2|1
25 5 2 1 Ind 2|2
26 6 2 0 Ind 2|2
27 7 2 1 Ind 2|3
28 8 2 0 Ind 2|3
29 9 2 0 Ind 2|3
30 10 2 0 Ind 2|3

Question 2

Do not use explicit loops unless you absolutely have to

Many functions in R are vectorized and your code will be much faster if you leverage this instead of writing your own loops.

For example, you can compute the first part of your newID with a single paste() call:

dat$newID <- paste("Ind", dat$ID)

 x ID change newID
1 1 0 0 Ind 0
2 2 0 0 Ind 0
3 3 0 0 Ind 0
4 4 0 0 Ind 0
5 5 0 1 Ind 0
6 6 0 0 Ind 0
7 7 0 1 Ind 0
8 8 0 0 Ind 0
9 9 0 0 Ind 0
10 10 0 0 Ind 0
11 1 1 1 Ind 1
12 2 1 0 Ind 1
13 3 1 0 Ind 1
14 4 1 0 Ind 1
15 5 1 1 Ind 1
16 6 1 0 Ind 1
17 7 1 1 Ind 1
18 8 1 0 Ind 1
19 9 1 0 Ind 1
...

The second part of your `newID` is simply the cumulative sum of `change`

The most tricky part here is to reset the counter each time the ID changes. A way to do this is to use the function by, which will execute a given function on a group of rows depending on the values of a grouping variable (here ID):

by(dat, dat$ID, function(x) {
 cumsum(x$change)
})

dat$ID: 0
 [1] 0 0 0 0 1 1 2 2 2 2
--------------------------------------------------------------------------------------------- 
dat$ID: 1
 [1] 1 1 1 1 2 2 3 3 3 3
--------------------------------------------------------------------------------------------- 
dat$ID: 2
 [1] 1 1 1 1 2 2 3 3 3 3

The two issues here are:

by returns a list so we have to use unlist() before trying to put the result in a data.frame column

dat$newID <- unlist(by(dat, dat$ID, function(x) {
 cumsum(x$change)
}))

because change doesn't start with 1, the values for the first ID are shifted by one. We can fix this by changing the first value manually.

Put everything together

dat$change[1] <- 1
dat$newID <- unlist(by(dat, dat$ID, function(x) {
 cumsum(x$change)
}))
dat$newID <- paste0("Ind ", dat$ID, "|", dat$newID)

And you get exactly the output you were asking for, without any explicit for loops!

Droplet DropletDroplet 1912 bronze badges · Accepted Answer · 2019-10-30 08:55:57Z

Do not use explicit loops unless you absolutely have to

Many functions in R are vectorized and your code will be much faster if you leverage this instead of writing your own loops.

For example, you can compute the first part of your newID with a single paste() call:

dat$newID <- paste("Ind", dat$ID)

 x ID change newID
1 1 0 0 Ind 0
2 2 0 0 Ind 0
3 3 0 0 Ind 0
4 4 0 0 Ind 0
5 5 0 1 Ind 0
6 6 0 0 Ind 0
7 7 0 1 Ind 0
8 8 0 0 Ind 0
9 9 0 0 Ind 0
10 10 0 0 Ind 0
11 1 1 1 Ind 1
12 2 1 0 Ind 1
13 3 1 0 Ind 1
14 4 1 0 Ind 1
15 5 1 1 Ind 1
16 6 1 0 Ind 1
17 7 1 1 Ind 1
18 8 1 0 Ind 1
19 9 1 0 Ind 1
...

The second part of your `newID` is simply the cumulative sum of `change`

The most tricky part here is to reset the counter each time the ID changes. A way to do this is to use the function by, which will execute a given function on a group of rows depending on the values of a grouping variable (here ID):

by(dat, dat$ID, function(x) {
 cumsum(x$change)
})

dat$ID: 0
 [1] 0 0 0 0 1 1 2 2 2 2
--------------------------------------------------------------------------------------------- 
dat$ID: 1
 [1] 1 1 1 1 2 2 3 3 3 3
--------------------------------------------------------------------------------------------- 
dat$ID: 2
 [1] 1 1 1 1 2 2 3 3 3 3

The two issues here are:

by returns a list so we have to use unlist() before trying to put the result in a data.frame column

dat$newID <- unlist(by(dat, dat$ID, function(x) {
 cumsum(x$change)
}))

because change doesn't start with 1, the values for the first ID are shifted by one. We can fix this by changing the first value manually.

Put everything together

dat$change[1] <- 1
dat$newID <- unlist(by(dat, dat$ID, function(x) {
 cumsum(x$change)
}))
dat$newID <- paste0("Ind ", dat$ID, "|", dat$newID)

And you get exactly the output you were asking for, without any explicit for loops!

Stack Exchange Network

Speed up a while loop that conditionally creates a new variable in R

1 Answer 1

Do not use explicit loops unless you absolutely have to

The second part of your `newID` is simply the cumulative sum of `change`

Put everything together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed up a while loop that conditionally creates a new variable in R

1 Answer 1

Do not use explicit loops unless you absolutely have to

The second part of your newID is simply the cumulative sum of change

Put everything together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

The second part of your `newID` is simply the cumulative sum of `change`