I have a long data.table, payoff_mat
. I would like to select the rows containing the maximum value of column X by group Y. There are several posts on this, and I think I am well optimized in this regard.
The twist is that I would like to repeat this process several times, for which I think I could do better. This is the hot-spot of my script.
This is how it works:
- Subset the rows containing the maximum value of column 3 by group (permutation of column 4-5).
- Subset again by the maximum of column 2 by group (permutation of column 4).
- Subset again by the maximum of column 1.
library(data.table)
library(iterpc)
set.seed(1)
I<-iterpc(table(rnorm(5)),3, ordered=TRUE,replace=FALSE)
support_n <-getnext(I,d=180) # not real size. Larger in practice.
payoff_n <-matrix(rnorm(180), ncol = 3) # not real size. Larger in practice.
payoff_mat<-cbind(payoff_n,support_n)
payoff_mat<-data.table(payoff_mat)
#My current solution: eval(parse(text = ....)) is long but seems to be the standard best syntax for data.table in these cases.
cn<-colnames(payoff_mat) #need column names, data.table does not like using column number.
step_1<-payoff_mat[payoff_mat[ ,.I[ which.max(eval(parse(text = cn[3]))) ] , by = eval(cn[4:5]) ][,V1]] #
step_2<-step_1[step_1[ ,.I[ which.max(eval(parse(text = cn[2]))) ] , by = eval(cn[4:4]) ][,V1]]
step_3<-step_2[ ,.SD[ which.max(eval(parse(text = cn[1]))) ] ]
My intuition suggests I lose a lot of time writing to memory by saving each step, but I am happy with any improvement.
I have tried calling it as a single command, but have not been successful figuring out the correct technique for sequential [] calls. Below is my (current) best attempt, but it is clear they are not equivalent:
payoff_mat [payoff_mat[ ,.I[ which.max(eval(parse(text = cn[3]))) ] , by = eval(cn[4:5]) ][,V1]] [payoff_mat[ ,.I[ which.max(eval(parse(text = cn[2]))) ] , by = eval(cn[4:4]) ][,V1]] [ ,.SD[ which.max(eval(parse(text = cn[1]))) ] ]
1 Answer 1
I am not a data.table expert so I can't tell if there is a faster approach, though https://stackoverflow.com/q/16573995/1201032 suggests you are already using something efficient at each step.
The changes I have to suggest would however make your code a lot easier to read and maintain. First, define a function that can be applied at each step:
max_subset <- function(dt, var, by)
dt[dt[, .I[which.max(get(var))], by = by][,V1]]
Then, you can make your successive calls as follows:
step <- payoff_mat
step <- max_subset(step, "V3", c("V4", "V5"))
step <- max_subset(step, "V2", c("V4"))
step <- max_subset(step, "V1", c())
Or using the magrittr
package:
library(magrittr)
payoff_mat %>% max_subset("V3", c("V4", "V5")) %>%
max_subset("V2", c("V4")) %>%
max_subset("V1", c()) -> step3
Note that assigning intermediate outputs to a variable in the first approach (step <- ...
) should not carry any significant time (feel free to test). And your code will be a lot more readable than using nested calls: max_subset(max_subset(max_subset(...), ..., ...)
.
-
\$\begingroup\$ Thank you for the answer. I think I would like to get it in one call. This is just a minimum working example, the intermediate output is actually very large, (something like 8+ GB of RAM for the first step), so I -think- it is writing this to memory and therefore is probably costly. \$\endgroup\$RegressForward– RegressForward2017年09月24日 19:35:12 +00:00Commented Sep 24, 2017 at 19:35
-
2\$\begingroup\$ Would you please test both approaches and circle back? I don't think your assumption that
y <- f(g(h(x)))
is less costly thany <- x; y <- h(y); y <- g(y); y <- f(y)
is correct. When you doy <- f(g(h(x)))
,h(x)
has to be computed in memory first. Assigning that memory space to a symbol (y
) is not going to be significant memory-wise or time-wise. \$\endgroup\$flodel– flodel2017年09月24日 21:15:36 +00:00Commented Sep 24, 2017 at 21:15