Repeated sequential subsetting of data.table

Question 1

I have a long data.table, payoff_mat. I would like to select the rows containing the maximum value of column X by group Y. There are several posts on this, and I think I am well optimized in this regard.

The twist is that I would like to repeat this process several times, for which I think I could do better. This is the hot-spot of my script.

This is how it works:

Subset the rows containing the maximum value of column 3 by group (permutation of column 4-5).
Subset again by the maximum of column 2 by group (permutation of column 4).
Subset again by the maximum of column 1.

library(data.table)
library(iterpc)
set.seed(1)
I<-iterpc(table(rnorm(5)),3, ordered=TRUE,replace=FALSE)
support_n <-getnext(I,d=180) # not real size. Larger in practice.
payoff_n <-matrix(rnorm(180), ncol = 3) # not real size. Larger in practice.
payoff_mat<-cbind(payoff_n,support_n)
payoff_mat<-data.table(payoff_mat)
#My current solution: eval(parse(text = ....)) is long but seems to be the standard best syntax for data.table in these cases.
cn<-colnames(payoff_mat) #need column names, data.table does not like using column number.
step_1<-payoff_mat[payoff_mat[ ,.I[ which.max(eval(parse(text = cn[3]))) ] , by = eval(cn[4:5]) ][,V1]] #
step_2<-step_1[step_1[ ,.I[ which.max(eval(parse(text = cn[2]))) ] , by = eval(cn[4:4]) ][,V1]]
step_3<-step_2[ ,.SD[ which.max(eval(parse(text = cn[1]))) ] ]

My intuition suggests I lose a lot of time writing to memory by saving each step, but I am happy with any improvement.

I have tried calling it as a single command, but have not been successful figuring out the correct technique for sequential [] calls. Below is my (current) best attempt, but it is clear they are not equivalent:

payoff_mat [payoff_mat[ ,.I[ which.max(eval(parse(text = cn[3]))) ] , by = eval(cn[4:5]) ][,V1]] [payoff_mat[ ,.I[ which.max(eval(parse(text = cn[2]))) ] , by = eval(cn[4:4]) ][,V1]] [ ,.SD[ which.max(eval(parse(text = cn[1]))) ] ]

Question 2

I am not a data.table expert so I can't tell if there is a faster approach, though https://stackoverflow.com/q/16573995/1201032 suggests you are already using something efficient at each step.

The changes I have to suggest would however make your code a lot easier to read and maintain. First, define a function that can be applied at each step:

max_subset <- function(dt, var, by)
 dt[dt[, .I[which.max(get(var))], by = by][,V1]]

Then, you can make your successive calls as follows:

step <- payoff_mat
step <- max_subset(step, "V3", c("V4", "V5"))
step <- max_subset(step, "V2", c("V4"))
step <- max_subset(step, "V1", c())

Or using the magrittr package:

library(magrittr)
payoff_mat %>% max_subset("V3", c("V4", "V5")) %>%
 max_subset("V2", c("V4")) %>%
 max_subset("V1", c()) -> step3

Note that assigning intermediate outputs to a variable in the first approach (step <- ...) should not carry any significant time (feel free to test). And your code will be a lot more readable than using nested calls: max_subset(max_subset(max_subset(...), ..., ...).

Question 3

Thank you for the answer. I think I would like to get it in one call. This is just a minimum working example, the intermediate output is actually very large, (something like 8+ GB of RAM for the first step), so I -think- it is writing this to memory and therefore is probably costly.

Question 4

Would you please test both approaches and circle back? I don't think your assumption that y <- f(g(h(x))) is less costly than y <- x; y <- h(y); y <- g(y); y <- f(y) is correct. When you do y <- f(g(h(x))), h(x) has to be computed in memory first. Assigning that memory space to a symbol (y) is not going to be significant memory-wise or time-wise.

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Answer 1 · 2017-09-22 19:22:34Z

I am not a data.table expert so I can't tell if there is a faster approach, though https://stackoverflow.com/q/16573995/1201032 suggests you are already using something efficient at each step.

The changes I have to suggest would however make your code a lot easier to read and maintain. First, define a function that can be applied at each step:

max_subset <- function(dt, var, by)
 dt[dt[, .I[which.max(get(var))], by = by][,V1]]

Then, you can make your successive calls as follows:

step <- payoff_mat
step <- max_subset(step, "V3", c("V4", "V5"))
step <- max_subset(step, "V2", c("V4"))
step <- max_subset(step, "V1", c())

Or using the magrittr package:

library(magrittr)
payoff_mat %>% max_subset("V3", c("V4", "V5")) %>%
 max_subset("V2", c("V4")) %>%
 max_subset("V1", c()) -> step3

Note that assigning intermediate outputs to a variable in the first approach (step <- ...) should not carry any significant time (feel free to test). And your code will be a lot more readable than using nested calls: max_subset(max_subset(max_subset(...), ..., ...).

Thank you for the answer. I think I would like to get it in one call. This is just a minimum working example, the intermediate output is actually very large, (something like 8+ GB of RAM for the first step), so I -think- it is writing this to memory and therefore is probably costly.
Would you please test both approaches and circle back? I don't think your assumption that y <- f(g(h(x))) is less costly than y <- x; y <- h(y); y <- g(y); y <- f(y) is correct. When you do y <- f(g(h(x))), h(x) has to be computed in memory first. Assigning that memory space to a symbol (y) is not going to be significant memory-wise or time-wise.

Stack Exchange Network

Repeated sequential subsetting of data.table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Repeated sequential subsetting of data.table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions