Cosine similarity of one vector with many

Question 1

I'm keen to hear ideas for optimising R code to compute the cosine similarity of a vector x (with length l) with n other vectors (stored in any structure such as a matrix m with n rows and l columns).

Values for n will typically be much larger than values for l.

I'm currently using this custom Rcpp function to compute the similarity of a vector x to each row of a matrix m:

library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
 int nrows = m.nrow();
 NumericVector out(nrows);
 for (int i = 0; i < nrows; i++) {
 NumericVector y = m(i, _);
 out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
 }
 return out;
}')

Varying n and l, I'm getting the following sorts of timings:

enter image description here

Reproducible code below.

# Function to simulate data
sim_data <- function(l, n) {
 # Feature vector to be used for computing similarity
 x <- runif(l)
 # Matrix of feature vectors (1 per row) to compare against x
 m <- matrix(runif(n * l), nrow = n)
 list(x = x, m = m)
}
# Rcpp function to compute similarity of x to each row of m
library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
 int nrows = m.nrow();
 NumericVector out(nrows);
 for (int i = 0; i < nrows; i++) {
 NumericVector y = m(i, _);
 out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
 }
 return out;
}') 
# Timer function
library(microbenchmark)
timer <- function(l, n) {
 dat <- sim_data(l, n)
 microbenchmark(cosine_x_to_m(dat$x, dat$m))
}
# Results for grid of l and n
library(tidyverse)
results <- cross_d(list(l = seq(200, 1000, by = 200), n = seq(500, 4000, by = 500))) %>% 
 mutate(timings = map2(l, n, timer))
# Plot results
results_plot <- results %>%
 unnest(timings) %>% 
 mutate(time = time / 1000000) %>% # Convert time to seconds
 group_by(l, n) %>% 
 summarise(mean = mean(time), ci = 1.96 * sd(time) / sqrt(n()))
pd <- position_dodge(width = 20)
results_plot %>% 
 ggplot(aes(n, mean, group= l)) +
 geom_line(aes(color = factor(l)), position = pd, size = 2) +
 geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), position = pd, width = 100) +
 geom_point(position = pd, size = 2) +
 scale_color_brewer(palette = "Blues") +
 theme_minimal() +
 labs(x = "n", y = "Seconds", color = "l") +
 ggtitle("Algorithm Runtime",
 subtitle = "Error bars represent 95% confidence intervals")

Question 2

On the 'code review' side of things, I have a few comments.

Question 3

The use of pow function for squaring numbers is inefficient. Simple x*x is faster.

Question 4

@AndreyShabalin I checked it with gcc 6.2.0 and could not find any "significant" difference between std::pow(x, 2), std::pow(x, 2.0) and x * x in terms of speed. Might be different of older compiler versions, e.g., martin-ueding.de/en/articles/efficiency-of-pow-function/….

Question 5

What you may and may not do after receiving answers. I've rolled back Rev 4 → 1.

Question 6

Right, thanks @200_success. I've now posted my solution as a separate answer.

Question 7

I'm using Microsoft R (with Intel MKL) which makes matrix multiplications faster, but for fair comparison I set it to be single threaded.

setMKLthreads(1)

In my tests this pure R version cosine_x_to_m is twice faster than yours.

cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}

Rewriting rowSums(m^2) in C/C++ makes it even faster, about four times faster than the original.

library(ramwas)
cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}

Initial performance:

Initial performance

Final version performance:

Final version performance

Question 8

Thanks, this has helped a huge amount! I tweaked a few things (based on your comments too) and used custom functions instead of ramwas::rowSumsSq (which I had trouble installing). I've added my updated solution in answer.

Question 9

I can get you this function separately from the rest of the package, but it seems you are doing well without it. Glad I could help.

Question 10

Believe it or not, but there is still some room for simplification and decent improvement in the function rowSumsSq. The reason is that due to the column-major order of matrices in memory, looping over rows within columns is more effecient than vice versa.

library('Rcpp')
n <- m <- 1000L
dat <- matrix(rnorm(n*m), nrow = n, ncol = m)
cppFunction('NumericVector rowSumsSq_faster(NumericMatrix x) {
 int nrow = x.nrow(), ncol = x.ncol();
 NumericVector out(nrow);
 for (int j = 0; j < ncol; ++j) {
 for (int i = 0; i < nrow; ++i) {
 out[i] += std::pow(x(i, j), 2);
 }
 }
 return out;
}')

A quick benchmark gives

microbenchmark::microbenchmark(times = 1e3L,
 rowSums(dat^2),
 rowSumsSq(dat),
 rowSumsSq_faster(dat)
)
Unit: microseconds
 expr min lq mean median uq
rowSums(dat^2) 4573.501 4691.8010 5830.381 4764.6240 5522.5845
rowSumsSq(dat) 1778.245 1855.7020 1886.566 1877.0595 1906.5605
rowSumsSq_faster(dat) 799.546 852.5835 869.913 865.1385 879.8275

Question 11

This is a really great addition! Thanks. I've added it into my updated solution.

Question 12

Based on answers and comments from @Andrey and @Dominik, I implemented the code below. A solid improvement ~ 60% better!

## C++ functions to offload matrix calculations
# Sum of squares for each row of a matrix
cppFunction('NumericVector rowSumsSq(NumericMatrix x) {
 int nrow = x.nrow(), ncol = x.ncol();
 NumericVector out(nrow);
 for (int j = 0; j < ncol; ++j) {
 for (int i = 0; i < nrow; ++i) {
 out[i] += std::pow(x(i, j), 2);
 }
 }
 return out;
}')
# Compute cosine similarity of vector x to each row in matrix m
cosine_x_to_m <- function(x, m) {
 x <- x / sqrt(crossprod(x))
 as.vector(m %*% x / sqrt(rowSumsSq(m)))
}

Using same timing and plotting functions:

enter image description here

Question 13

Thanks for posting a final solution. I have two further comments. First, crossprod(x) will be faster than sum(x * x). Second, your function mxMult is not necessary, simply replace it by m %*% x and your code will be as fast or even faster. In my experience, it is impossible to beat crossprod, tcrossprod, %*% by anything hand-written in Rcpp. Just try for fun of it to implement full matrix mutliplication %*% in Rcpp by three nested for loops and I would bet it will be way slower.

Question 14

And it just keeps getting better! I've updated answer. Thanks :)

Andrey Shabalin Andrey Shabalin 1563 bronze badges · Accepted Answer · 2017-03-30 21:21:39Z

I'm using Microsoft R (with Intel MKL) which makes matrix multiplications faster, but for fair comparison I set it to be single threaded.

setMKLthreads(1)

In my tests this pure R version cosine_x_to_m is twice faster than yours.

cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}

Rewriting rowSums(m^2) in C/C++ makes it even faster, about four times faster than the original.

library(ramwas)
cosine_x_to_m2 = function(x,m){
 x = x / sqrt(crossprod(x));
 return( as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}

Initial performance:

Initial performance

Final version performance:

Final version performance

Thanks, this has helped a huge amount! I tweaked a few things (based on your comments too) and used custom functions instead of ramwas::rowSumsSq (which I had trouble installing). I've added my updated solution in answer.
I can get you this function separately from the rest of the package, but it seems you are doing well without it. Glad I could help.

Stack Exchange Network

Cosine similarity of one vector with many

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Cosine similarity of one vector with many

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions