I'm keen to hear ideas for optimising R code to compute the cosine similarity of a vector x
(with length l
) with n
other vectors (stored in any structure such as a matrix m
with n
rows and l
columns).
Values for n
will typically be much larger than values for l
.
I'm currently using this custom Rcpp function to compute the similarity of a vector x
to each row of a matrix m
:
library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
int nrows = m.nrow();
NumericVector out(nrows);
for (int i = 0; i < nrows; i++) {
NumericVector y = m(i, _);
out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
}
return out;
}')
Varying n
and l
, I'm getting the following sorts of timings:
Reproducible code below.
# Function to simulate data
sim_data <- function(l, n) {
# Feature vector to be used for computing similarity
x <- runif(l)
# Matrix of feature vectors (1 per row) to compare against x
m <- matrix(runif(n * l), nrow = n)
list(x = x, m = m)
}
# Rcpp function to compute similarity of x to each row of m
library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
int nrows = m.nrow();
NumericVector out(nrows);
for (int i = 0; i < nrows; i++) {
NumericVector y = m(i, _);
out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
}
return out;
}')
# Timer function
library(microbenchmark)
timer <- function(l, n) {
dat <- sim_data(l, n)
microbenchmark(cosine_x_to_m(dat$x, dat$m))
}
# Results for grid of l and n
library(tidyverse)
results <- cross_d(list(l = seq(200, 1000, by = 200), n = seq(500, 4000, by = 500))) %>%
mutate(timings = map2(l, n, timer))
# Plot results
results_plot <- results %>%
unnest(timings) %>%
mutate(time = time / 1000000) %>% # Convert time to seconds
group_by(l, n) %>%
summarise(mean = mean(time), ci = 1.96 * sd(time) / sqrt(n()))
pd <- position_dodge(width = 20)
results_plot %>%
ggplot(aes(n, mean, group= l)) +
geom_line(aes(color = factor(l)), position = pd, size = 2) +
geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), position = pd, width = 100) +
geom_point(position = pd, size = 2) +
scale_color_brewer(palette = "Blues") +
theme_minimal() +
labs(x = "n", y = "Seconds", color = "l") +
ggtitle("Algorithm Runtime",
subtitle = "Error bars represent 95% confidence intervals")
3 Answers 3
I'm using Microsoft R (with Intel MKL) which makes matrix multiplications faster, but for fair comparison I set it to be single threaded.
setMKLthreads(1)
In my tests this pure R version cosine_x_to_m
is twice faster than yours.
cosine_x_to_m2 = function(x,m){
x = x / sqrt(crossprod(x));
return( as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}
Rewriting rowSums(m^2)
in C/C++ makes it even faster, about four times faster than the original.
library(ramwas)
cosine_x_to_m2 = function(x,m){
x = x / sqrt(crossprod(x));
return( as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}
Initial performance:
Final version performance:
-
\$\begingroup\$ Thanks, this has helped a huge amount! I tweaked a few things (based on your comments too) and used custom functions instead of ramwas::rowSumsSq (which I had trouble installing). I've added my updated solution in answer. \$\endgroup\$Simon Jackson– Simon Jackson2017年04月02日 22:13:49 +00:00Commented Apr 2, 2017 at 22:13
-
\$\begingroup\$ I can get you this function separately from the rest of the package, but it seems you are doing well without it. Glad I could help. \$\endgroup\$Andrey Shabalin– Andrey Shabalin2017年04月03日 00:40:23 +00:00Commented Apr 3, 2017 at 0:40
Believe it or not, but there is still some room for simplification and decent improvement in the function rowSumsSq
. The reason is that due to the column-major order of matrices in memory, looping over rows within columns is more effecient than vice versa.
library('Rcpp')
n <- m <- 1000L
dat <- matrix(rnorm(n*m), nrow = n, ncol = m)
cppFunction('NumericVector rowSumsSq_faster(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int j = 0; j < ncol; ++j) {
for (int i = 0; i < nrow; ++i) {
out[i] += std::pow(x(i, j), 2);
}
}
return out;
}')
A quick benchmark gives
microbenchmark::microbenchmark(times = 1e3L,
rowSums(dat^2),
rowSumsSq(dat),
rowSumsSq_faster(dat)
)
Unit: microseconds
expr min lq mean median uq
rowSums(dat^2) 4573.501 4691.8010 5830.381 4764.6240 5522.5845
rowSumsSq(dat) 1778.245 1855.7020 1886.566 1877.0595 1906.5605
rowSumsSq_faster(dat) 799.546 852.5835 869.913 865.1385 879.8275
-
1\$\begingroup\$ This is a really great addition! Thanks. I've added it into my updated solution. \$\endgroup\$Simon Jackson– Simon Jackson2017年04月19日 23:31:23 +00:00Commented Apr 19, 2017 at 23:31
Based on answers and comments from @Andrey and @Dominik, I implemented the code below. A solid improvement ~ 60% better!
## C++ functions to offload matrix calculations
# Sum of squares for each row of a matrix
cppFunction('NumericVector rowSumsSq(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int j = 0; j < ncol; ++j) {
for (int i = 0; i < nrow; ++i) {
out[i] += std::pow(x(i, j), 2);
}
}
return out;
}')
# Compute cosine similarity of vector x to each row in matrix m
cosine_x_to_m <- function(x, m) {
x <- x / sqrt(crossprod(x))
as.vector(m %*% x / sqrt(rowSumsSq(m)))
}
Using same timing and plotting functions:
-
2\$\begingroup\$ Thanks for posting a final solution. I have two further comments. First,
crossprod(x)
will be faster thansum(x * x)
. Second, your functionmxMult
is not necessary, simply replace it bym %*% x
and your code will be as fast or even faster. In my experience, it is impossible to beatcrossprod
,tcrossprod
,%*%
by anything hand-written in Rcpp. Just try for fun of it to implement full matrix mutliplication%*%
in Rcpp by three nested for loops and I would bet it will be way slower. \$\endgroup\$NoBackingDown– NoBackingDown2017年04月22日 10:02:17 +00:00Commented Apr 22, 2017 at 10:02 -
\$\begingroup\$ And it just keeps getting better! I've updated answer. Thanks :) \$\endgroup\$Simon Jackson– Simon Jackson2017年04月27日 05:06:08 +00:00Commented Apr 27, 2017 at 5:06
pow
function for squaring numbers is inefficient. Simplex*x
is faster. \$\endgroup\$std::pow(x, 2)
,std::pow(x, 2.0)
andx * x
in terms of speed. Might be different of older compiler versions, e.g., martin-ueding.de/en/articles/efficiency-of-pow-function/…. \$\endgroup\$