1
\$\begingroup\$

I was trying to see if data.table could speed up a gsub pattern matching function over a list.

Data for reprex. It's a list of 3 data frames with some asterisks placed here and there. Each data frame is 6500 rows, 2 columns, and generally representative of my actual data. My data does have multiple columns per data frame that need to be looped over, which is why I'm using the mapply.

library(data.table)
library(microbenchmark)
 df1 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df1$code[df1$code == "d" | df1$code == "f" | df1$code == "l"] <- "*foo"
df2 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "foo*"
df2$code[df2$code == "a" | df2$code == "r" | df2$code == "t"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "foo*"
df3$code[df3$code == "c" | df3$code == "q" | df3$code == "w"] <- "*f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)

In this example, I am trying to remove any * symbols from character strings. First function was just using an mapply and gsub. It deletes any *, looping over elements. Second was an attempt to do it using the data.table library.

mapply.remove.asterisk = function(x){
 df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
 colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
 x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}

Testing them out doesn't show a big difference, but the mapply is slightly slower.

mapgsubtest = function(x) {
 df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
 dt.test <- lapply(x, dt.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), neval = 100)
Unit: nanoseconds
 expr min lq mean median uq max neval
 mapgsubtest(df) 7161991 7388846 7780101.83 7483794 7651907 27860732 100
 dtgsubtest(dt) 6759663 6991926 7181127.95 7109710 7275418 10102686 100
 neval 0 0 12.26 0 1 902 100

Is there something I'm doing within data.table that could be improved? I tried to see if a few things sped everything up, like having * only at the end of strings (only foo*), using an end of string regex anchor $, and setting an index key. Nothing changed noticeably.

asked Oct 19, 2018 at 16:34
\$\endgroup\$
2
  • 1
    \$\begingroup\$ I'm curious why you need to make your code faster (6500 rows and your current computation times seem pretty small), could you elaborate? That being said, using gsub with the fixed = TRUE option (for which the first argument should be changed to "*") will give you that x10 improvement you are probably after. Also note that when using microbenchmark, you should be using times = 100 rather than neval = 100. \$\endgroup\$ Commented Oct 23, 2018 at 2:15
  • \$\begingroup\$ I just chose that to primarily be representative for the pattern matching problem. My working datasets are larger, but I didn't want to post an example on the order of MBs, so I kept it smallish. You are correct, using fixed vs perl did speed things up. Thank you for that suggestion. This is just part of some code maintenance I'm working on, and one of the bigger bottlenecks, apart from the LaTeX. It initially lacked the perl = T, which sped it up, so I was curious if data.table could improve it at all. \$\endgroup\$ Commented Oct 23, 2018 at 16:08

1 Answer 1

2
\$\begingroup\$

Is there a reason that you are using mapply to gsub through the number columns as well? You can just replace in the first column if that is all you need, which gets some speed improvement, about 2x on my machine. I also tried using stringi instead of gsub but it was not faster. This is also a speedup on the order of milliseconds though!

library(data.table)
#> Warning: package 'data.table' was built under R version 3.5.1
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.5.1
library(stringi)
df1 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df2 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)
mapply.remove.asterisk = function(x){
 df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
 colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
 x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}
stringi.remove.asterisk = function (x) {
 out <- x
 out$name <- stri_replace_all_regex(x$name, "\\*", "")
 out
}
gsub.remove.asterisk = function(x) {
 out <- x
 out$name <- gsub("\\*", "", x$name)
 out
}
mapgsubtest = function(x) {
 df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
 dt.test <- lapply(x, dt.remove.asterisk)
}
strisubtest = function(x) {
 str.test <- lapply(x, stringi.remove.asterisk)
}
gsubtest = function(x){
 gsub.test <- lapply(x, gsub.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), strisubtest(df), gsubtest(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> mapgsubtest(df) 8.031179 8.789332 9.429985 9.164945 9.740215 12.913776
#> dtgsubtest(dt) 7.276307 7.867076 8.553440 8.217892 8.855339 22.473660
#> strisubtest(df) 8.149333 8.745572 9.391304 9.221469 9.846153 13.492875
#> gsubtest(df) 4.153983 4.667258 5.053101 4.789789 5.231771 9.813332
#> neval
#> 100
#> 100
#> 100
#> 100

Created on 2018年10月20日 by the reprex package (v0.2.0).

answered Oct 20, 2018 at 20:32
\$\endgroup\$
3
  • \$\begingroup\$ Thanks for your answer. I am using mapply because I have multiple columns per data frame where I'm removing the *. \$\endgroup\$ Commented Oct 22, 2018 at 14:08
  • \$\begingroup\$ You might wish to include that in your sample data, because it becomes important if you are trying to profile for speed \$\endgroup\$ Commented Oct 22, 2018 at 17:30
  • \$\begingroup\$ Thank you for the suggestion. I have revised my example data to reflect that. \$\endgroup\$ Commented Oct 22, 2018 at 18:46

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.