Character pattern replacement using gsub, loops, and data.table

Question 1

I was trying to see if data.table could speed up a gsub pattern matching function over a list.

Data for reprex. It's a list of 3 data frames with some asterisks placed here and there. Each data frame is 6500 rows, 2 columns, and generally representative of my actual data. My data does have multiple columns per data frame that need to be looped over, which is why I'm using the mapply.

library(data.table)
library(microbenchmark)
 df1 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df1$code[df1$code == "d" | df1$code == "f" | df1$code == "l"] <- "*foo"
df2 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "foo*"
df2$code[df2$code == "a" | df2$code == "r" | df2$code == "t"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), code = rep(letters, 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "foo*"
df3$code[df3$code == "c" | df3$code == "q" | df3$code == "w"] <- "*f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)

In this example, I am trying to remove any * symbols from character strings. First function was just using an mapply and gsub. It deletes any *, looping over elements. Second was an attempt to do it using the data.table library.

mapply.remove.asterisk = function(x){
 df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
 colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
 x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}

Testing them out doesn't show a big difference, but the mapply is slightly slower.

mapgsubtest = function(x) {
 df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
 dt.test <- lapply(x, dt.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), neval = 100)
Unit: nanoseconds
 expr min lq mean median uq max neval
 mapgsubtest(df) 7161991 7388846 7780101.83 7483794 7651907 27860732 100
 dtgsubtest(dt) 6759663 6991926 7181127.95 7109710 7275418 10102686 100
 neval 0 0 12.26 0 1 902 100

Is there something I'm doing within data.table that could be improved? I tried to see if a few things sped everything up, like having * only at the end of strings (only foo*), using an end of string regex anchor $, and setting an index key. Nothing changed noticeably.

Question 2

I'm curious why you need to make your code faster (6500 rows and your current computation times seem pretty small), could you elaborate? That being said, using gsub with the fixed = TRUE option (for which the first argument should be changed to "*") will give you that x10 improvement you are probably after. Also note that when using microbenchmark, you should be using times = 100 rather than neval = 100.

Question 3

I just chose that to primarily be representative for the pattern matching problem. My working datasets are larger, but I didn't want to post an example on the order of MBs, so I kept it smallish. You are correct, using fixed vs perl did speed things up. Thank you for that suggestion. This is just part of some code maintenance I'm working on, and one of the bigger bottlenecks, apart from the LaTeX. It initially lacked the perl = T, which sped it up, so I was curious if data.table could improve it at all.

Question 4

Is there a reason that you are using mapply to gsub through the number columns as well? You can just replace in the first column if that is all you need, which gets some speed improvement, about 2x on my machine. I also tried using stringi instead of gsub but it was not faster. This is also a speedup on the order of milliseconds though!

library(data.table)
#> Warning: package 'data.table' was built under R version 3.5.1
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.5.1
library(stringi)
df1 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df2 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)
mapply.remove.asterisk = function(x){
 df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
 colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
 x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}
stringi.remove.asterisk = function (x) {
 out <- x
 out$name <- stri_replace_all_regex(x$name, "\\*", "")
 out
}
gsub.remove.asterisk = function(x) {
 out <- x
 out$name <- gsub("\\*", "", x$name)
 out
}
mapgsubtest = function(x) {
 df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
 dt.test <- lapply(x, dt.remove.asterisk)
}
strisubtest = function(x) {
 str.test <- lapply(x, stringi.remove.asterisk)
}
gsubtest = function(x){
 gsub.test <- lapply(x, gsub.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), strisubtest(df), gsubtest(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> mapgsubtest(df) 8.031179 8.789332 9.429985 9.164945 9.740215 12.913776
#> dtgsubtest(dt) 7.276307 7.867076 8.553440 8.217892 8.855339 22.473660
#> strisubtest(df) 8.149333 8.745572 9.391304 9.221469 9.846153 13.492875
#> gsubtest(df) 4.153983 4.667258 5.053101 4.789789 5.231771 9.813332
#> neval
#> 100
#> 100
#> 100
#> 100

Created on 2018年10月20日 by the reprex package (v0.2.0).

Question 5

Thanks for your answer. I am using mapply because I have multiple columns per data frame where I'm removing the *.

Question 6

You might wish to include that in your sample data, because it becomes important if you are trying to profile for speed

Question 7

Thank you for the suggestion. I have revised my example data to reflect that.

Calum You Calum You 1212 bronze badges · Answer 1 · 2018-10-20 20:32:12Z

Is there a reason that you are using mapply to gsub through the number columns as well? You can just replace in the first column if that is all you need, which gets some speed improvement, about 2x on my machine. I also tried using stringi instead of gsub but it was not faster. This is also a speedup on the order of milliseconds though!

library(data.table)
#> Warning: package 'data.table' was built under R version 3.5.1
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.5.1
library(stringi)
df1 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df1$name[df1$name == "D" | df1$name == "F" | df1$name == "L"] <- "foo*"
df2 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df2$name[df2$name == "A" | df2$name == "R" | df2$name == "T"] <- "*foo*"
df3 <- data.frame(name = rep(LETTERS, 250), number = rep(c(1:26), 250), stringsAsFactors = FALSE)
df3$name[df3$name == "C" | df3$name == "Q" | df3$name == "W"] <- "f*oo"
df <- list(df1, df2, df3)
dt <- lapply(df, as.data.table)
mapply.remove.asterisk = function(x){
 df2 <- data.frame(mapply(gsub, "\\*", "", x, perl = TRUE))
 colnames(df2) <- colnames(x)
}
dt.remove.asterisk = function (x) {
 x[, lapply(.SD, function(x) gsub("\\*", "", x, perl = TRUE))]
}
stringi.remove.asterisk = function (x) {
 out <- x
 out$name <- stri_replace_all_regex(x$name, "\\*", "")
 out
}
gsub.remove.asterisk = function(x) {
 out <- x
 out$name <- gsub("\\*", "", x$name)
 out
}
mapgsubtest = function(x) {
 df.test <- lapply(x, mapply.remove.asterisk)
}
dtgsubtest = function(x) {
 dt.test <- lapply(x, dt.remove.asterisk)
}
strisubtest = function(x) {
 str.test <- lapply(x, stringi.remove.asterisk)
}
gsubtest = function(x){
 gsub.test <- lapply(x, gsub.remove.asterisk)
}
microbenchmark(mapgsubtest(df), dtgsubtest(dt), strisubtest(df), gsubtest(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> mapgsubtest(df) 8.031179 8.789332 9.429985 9.164945 9.740215 12.913776
#> dtgsubtest(dt) 7.276307 7.867076 8.553440 8.217892 8.855339 22.473660
#> strisubtest(df) 8.149333 8.745572 9.391304 9.221469 9.846153 13.492875
#> gsubtest(df) 4.153983 4.667258 5.053101 4.789789 5.231771 9.813332
#> neval
#> 100
#> 100
#> 100
#> 100

Created on 2018年10月20日 by the reprex package (v0.2.0).

Thanks for your answer. I am using mapply because I have multiple columns per data frame where I'm removing the *.
You might wish to include that in your sample data, because it becomes important if you are trying to profile for speed
Thank you for the suggestion. I have revised my example data to reflect that.

Stack Exchange Network

Character pattern replacement using gsub, loops, and data.table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Character pattern replacement using gsub, loops, and data.table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions