7
\$\begingroup\$

I'm keen to hear ideas for optimising R code to compute the cosine similarity of a vector x (with length l) with n other vectors (stored in any structure such as a matrix m with n rows and l columns).

Values for n will typically be much larger than values for l.

I'm currently using this custom Rcpp function to compute the similarity of a vector x to each row of a matrix m:

library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
 int nrows = m.nrow();
 NumericVector out(nrows);
 for (int i = 0; i < nrows; i++) {
 NumericVector y = m(i, _);
 out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
 }
 return out;
}')

Varying n and l, I'm getting the following sorts of timings:

enter image description here

Reproducible code below.


# Function to simulate data
sim_data <- function(l, n) {
 # Feature vector to be used for computing similarity
 x <- runif(l)
 # Matrix of feature vectors (1 per row) to compare against x
 m <- matrix(runif(n * l), nrow = n)
 list(x = x, m = m)
}
# Rcpp function to compute similarity of x to each row of m
library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
 int nrows = m.nrow();
 NumericVector out(nrows);
 for (int i = 0; i < nrows; i++) {
 NumericVector y = m(i, _);
 out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
 }
 return out;
}') 
# Timer function
library(microbenchmark)
timer <- function(l, n) {
 dat <- sim_data(l, n)
 microbenchmark(cosine_x_to_m(dat$x, dat$m))
}
# Results for grid of l and n
library(tidyverse)
results <- cross_d(list(l = seq(200, 1000, by = 200), n = seq(500, 4000, by = 500))) %>% 
 mutate(timings = map2(l, n, timer))
# Plot results
results_plot <- results %>%
 unnest(timings) %>% 
 mutate(time = time / 1000000) %>% # Convert time to seconds
 group_by(l, n) %>% 
 summarise(mean = mean(time), ci = 1.96 * sd(time) / sqrt(n()))
pd <- position_dodge(width = 20)
results_plot %>% 
 ggplot(aes(n, mean, group= l)) +
 geom_line(aes(color = factor(l)), position = pd, size = 2) +
 geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), position = pd, width = 100) +
 geom_point(position = pd, size = 2) +
 scale_color_brewer(palette = "Blues") +
 theme_minimal() +
 labs(x = "n", y = "Seconds", color = "l") +
 ggtitle("Algorithm Runtime",
 subtitle = "Error bars represent 95% confidence intervals")
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Mar 30, 2017 at 20:45
\$\endgroup\$
11
  • \$\begingroup\$ On the 'code review' side of things, I have a few comments. \$\endgroup\$ Commented Mar 30, 2017 at 21:58
  • 1
    \$\begingroup\$ The use of pow function for squaring numbers is inefficient. Simple x*x is faster. \$\endgroup\$ Commented Mar 30, 2017 at 22:01
  • 1
    \$\begingroup\$ @AndreyShabalin I checked it with gcc 6.2.0 and could not find any "significant" difference between std::pow(x, 2), std::pow(x, 2.0) and x * x in terms of speed. Might be different of older compiler versions, e.g., martin-ueding.de/en/articles/efficiency-of-pow-function/…. \$\endgroup\$ Commented Apr 20, 2017 at 5:48
  • 1
    \$\begingroup\$ What you may and may not do after receiving answers. I've rolled back Rev 4 → 1. \$\endgroup\$ Commented Apr 20, 2017 at 23:53
  • 1
    \$\begingroup\$ Right, thanks @200_success. I've now posted my solution as a separate answer. \$\endgroup\$ Commented Apr 21, 2017 at 1:44

3 Answers 3

4
\$\begingroup\$

I'm using Microsoft R (with Intel MKL) which makes matrix multiplications faster, but for fair comparison I set it to be single threaded.

setMKLthreads(1)

In my tests this pure R version cosine_x_to_m is twice faster than yours.

cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}

Rewriting rowSums(m^2) in C/C++ makes it even faster, about four times faster than the original.

library(ramwas)
cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}

Initial performance:

Initial performance

Final version performance:

Final version performance

answered Mar 30, 2017 at 21:21
\$\endgroup\$
2
  • \$\begingroup\$ Thanks, this has helped a huge amount! I tweaked a few things (based on your comments too) and used custom functions instead of ramwas::rowSumsSq (which I had trouble installing). I've added my updated solution in answer. \$\endgroup\$ Commented Apr 2, 2017 at 22:13
  • \$\begingroup\$ I can get you this function separately from the rest of the package, but it seems you are doing well without it. Glad I could help. \$\endgroup\$ Commented Apr 3, 2017 at 0:40
2
\$\begingroup\$

Believe it or not, but there is still some room for simplification and decent improvement in the function rowSumsSq. The reason is that due to the column-major order of matrices in memory, looping over rows within columns is more effecient than vice versa.

library('Rcpp')
n <- m <- 1000L
dat <- matrix(rnorm(n*m), nrow = n, ncol = m)
cppFunction('NumericVector rowSumsSq_faster(NumericMatrix x) {
 int nrow = x.nrow(), ncol = x.ncol();
 NumericVector out(nrow);
 for (int j = 0; j < ncol; ++j) {
 for (int i = 0; i < nrow; ++i) {
 out[i] += std::pow(x(i, j), 2);
 }
 }
 return out;
}')

A quick benchmark gives

microbenchmark::microbenchmark(times = 1e3L,
 rowSums(dat^2),
 rowSumsSq(dat),
 rowSumsSq_faster(dat)
)
Unit: microseconds
 expr min lq mean median uq
rowSums(dat^2) 4573.501 4691.8010 5830.381 4764.6240 5522.5845
rowSumsSq(dat) 1778.245 1855.7020 1886.566 1877.0595 1906.5605
rowSumsSq_faster(dat) 799.546 852.5835 869.913 865.1385 879.8275
answered Apr 19, 2017 at 21:16
\$\endgroup\$
1
  • 1
    \$\begingroup\$ This is a really great addition! Thanks. I've added it into my updated solution. \$\endgroup\$ Commented Apr 19, 2017 at 23:31
0
\$\begingroup\$

Based on answers and comments from @Andrey and @Dominik, I implemented the code below. A solid improvement ~ 60% better!

## C++ functions to offload matrix calculations
# Sum of squares for each row of a matrix
cppFunction('NumericVector rowSumsSq(NumericMatrix x) {
 int nrow = x.nrow(), ncol = x.ncol();
 NumericVector out(nrow);
 for (int j = 0; j < ncol; ++j) {
 for (int i = 0; i < nrow; ++i) {
 out[i] += std::pow(x(i, j), 2);
 }
 }
 return out;
}')
# Compute cosine similarity of vector x to each row in matrix m
cosine_x_to_m <- function(x, m) {
 x <- x / sqrt(crossprod(x))
 as.vector(m %*% x / sqrt(rowSumsSq(m)))
}

Using same timing and plotting functions:

enter image description here

answered Apr 21, 2017 at 1:44
\$\endgroup\$
2
  • 2
    \$\begingroup\$ Thanks for posting a final solution. I have two further comments. First, crossprod(x) will be faster than sum(x * x). Second, your function mxMult is not necessary, simply replace it by m %*% x and your code will be as fast or even faster. In my experience, it is impossible to beat crossprod, tcrossprod, %*% by anything hand-written in Rcpp. Just try for fun of it to implement full matrix mutliplication %*% in Rcpp by three nested for loops and I would bet it will be way slower. \$\endgroup\$ Commented Apr 22, 2017 at 10:02
  • \$\begingroup\$ And it just keeps getting better! I've updated answer. Thanks :) \$\endgroup\$ Commented Apr 27, 2017 at 5:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.