I want to find a quick way to see if a matrix M has at least one value that is, say, 2. In R, I would use any(M==2)
. However, this computes first M==2
for all values in M
, then use any()
. any()
will stop at the first time a TRUE
value is found, but that still means we computed way too many M==2
conditions.
I thought one could find a more efficient way, computing M==2
only as long as it is not satisfied. I tried to write a function to do this (either column-wise check
, or on each element of M
, check_2
), but it is so far much slower. Any idea on how to improve this?
Results of benchmark, where the value Val is rather at the end of the matrix:
|expr |mean time |
|:------------------|---------:|
|any(M == Val) | 14.13623|
|is.element(Val, M) | 17.71230|
|check(M, Val) | 18.20764|
|check_2(M, Val) | 486.65347|
Code:
x <- 1:10^6
M <- matrix(x, ncol = 10, byrow=TRUE)
Val <- 50000
check <- function(x, Val) {
i <- 1
cond <- FALSE
while(!cond & i <= ncol(x)) {
cond <- any(M[,i]==Val)
i <- i +1
}
cond
}
check_2 <- function(x, Val) {
x_c <- c(x)
i <- 1
cond <- FALSE
while(!cond & i <= length(x_c)) {
cond <- x_c[i]==Val
i <- i +1
}
cond
}
check_2(x=M, Val)
check(M, Val)
library(microbenchmark)
comp <- microbenchmark(any(M == Val),
is.element(Val, M),
check(M, Val),
check_2(M, Val),
times = 20)
comp
1 Answer 1
any
is a primitive, it doesn't loop in R
but in C
, which is much much faster.
loops in R
are quite slow, that's why it's important that you use said vectorized functions if you care about speed (apply functions are still loops however).
A way to speed things up is to use package Rcpp
to write code in C++
through R, when you have a slow R
function that uses simple loops it's the way to go, it's still not as fast as C
but in our case maybe that'll be enough given we don't need to go through all the vector ?
Let's check:
# defines anyx_cpp
cppFunction(
'bool anyx_cpp(const NumericVector x,const double y) {
const double n = x.size();
for (double i = 1; i < n; i++) {
if (x(i) == y) {
return(true);
}
}
return false;
}')
anyx_r <- function(x,y){
for(x_ in x) if(x_ == y) return(TRUE)
FALSE
}
vec <- 1:1e7
x <- 5e6
microbenchmark::microbenchmark(
rloop = anyx_r(vec,x),
cpp = anyx_cpp(vec,x),
native = any(vec==x)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rloop 166.5758 171.34355 203.15277 179.9776 198.8560 990.1650 100
# cpp 39.5462 40.60585 57.84617 41.4594 46.1232 690.1746 100
# native 36.9900 37.86090 51.80317 38.9640 43.6510 888.3059 100
Almost but not quite ;).
So bottom line, in general you can trust vectorized R functions, even if it might seem they're working too much at first sight.
check()
if yourVal
is found in the last column. But it makes a different, for example, withVal <- 1
. \$\endgroup\$