I was trying to see if data.table
could speed up a gsub
pattern matching function over a list.
Data for reprex. It's a list of 3 data frames with some asterisks placed here and there. Each data frame is 6500 rows, 2 columns, and generally representative of my actual data. My data does have multiple columns per data frame that need to be looped over, which is why I'm using the mapply
.
library(data.table)
library(microbenchmark)
df1 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df1$code[df1$code == "d" | df1$code == "f" | df1$code == "l"] <- "*foo"
df2 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "foo*"
df2$code[df2$code == "a" | df2$code == "r" | df2$code == "t"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "foo*"
df3$code[df3$code == "c" | df3$code == "q" | df3$code == "w"] <- "*f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)
In this example, I am trying to remove any *
symbols from character strings. First function was just using an mapply
and gsub
. It deletes any *
, looping over elements. Second was an attempt to do it using the data.table
library.
mapply.remove.asterisk = function(x){
df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}
Testing them out doesn't show a big difference, but the mapply
is slightly slower.
mapgsubtest = function(x) {
df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
dt.test <- lapply(x, dt.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), neval = 100)
Unit: nanoseconds
expr min lq mean median uq max neval
mapgsubtest(df) 7161991 7388846 7780101.83 7483794 7651907 27860732 100
dtgsubtest(dt) 6759663 6991926 7181127.95 7109710 7275418 10102686 100
neval 0 0 12.26 0 1 902 100
Is there something I'm doing within data.table
that could be improved? I tried to see if a few things sped everything up, like having *
only at the end of strings (only foo*
), using an end of string regex anchor $
, and setting an index key. Nothing changed noticeably.
1 Answer 1
Is there a reason that you are using mapply
to gsub through the number columns as well? You can just replace in the first column if that is all you need, which gets some speed improvement, about 2x on my machine. I also tried using stringi
instead of gsub but it was not faster. This is also a speedup on the order of milliseconds though!
library(data.table)
#> Warning: package 'data.table' was built under R version 3.5.1
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.5.1
library(stringi)
df1 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df2 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)
mapply.remove.asterisk = function(x){
df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}
stringi.remove.asterisk = function (x) {
out <- x
out$name <- stri_replace_all_regex(x$name, "\\*", "")
out
}
gsub.remove.asterisk = function(x) {
out <- x
out$name <- gsub("\\*", "", x$name)
out
}
mapgsubtest = function(x) {
df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
dt.test <- lapply(x, dt.remove.asterisk)
}
strisubtest = function(x) {
str.test <- lapply(x, stringi.remove.asterisk)
}
gsubtest = function(x){
gsub.test <- lapply(x, gsub.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), strisubtest(df), gsubtest(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> mapgsubtest(df) 8.031179 8.789332 9.429985 9.164945 9.740215 12.913776
#> dtgsubtest(dt) 7.276307 7.867076 8.553440 8.217892 8.855339 22.473660
#> strisubtest(df) 8.149333 8.745572 9.391304 9.221469 9.846153 13.492875
#> gsubtest(df) 4.153983 4.667258 5.053101 4.789789 5.231771 9.813332
#> neval
#> 100
#> 100
#> 100
#> 100
Created on 2018年10月20日 by the reprex package (v0.2.0).
-
\$\begingroup\$ Thanks for your answer. I am using
mapply
because I have multiple columns per data frame where I'm removing the*
. \$\endgroup\$Anonymous coward– Anonymous coward2018年10月22日 14:08:05 +00:00Commented Oct 22, 2018 at 14:08 -
\$\begingroup\$ You might wish to include that in your sample data, because it becomes important if you are trying to profile for speed \$\endgroup\$Calum You– Calum You2018年10月22日 17:30:57 +00:00Commented Oct 22, 2018 at 17:30
-
\$\begingroup\$ Thank you for the suggestion. I have revised my example data to reflect that. \$\endgroup\$Anonymous coward– Anonymous coward2018年10月22日 18:46:58 +00:00Commented Oct 22, 2018 at 18:46
gsub
with thefixed = TRUE
option (for which the first argument should be changed to"*"
) will give you that x10 improvement you are probably after. Also note that when usingmicrobenchmark
, you should be usingtimes = 100
rather thanneval = 100
. \$\endgroup\$fixed
vsperl
did speed things up. Thank you for that suggestion. This is just part of some code maintenance I'm working on, and one of the bigger bottlenecks, apart from the LaTeX. It initially lacked theperl = T
, which sped it up, so I was curious ifdata.table
could improve it at all. \$\endgroup\$