Cosine similarity

Question 1

Is there anything to make the cosine similarity function more optimized in terms of CPU execution time?

double cosine_similarity(double *A, double *B, unsigned int size)
{
 double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
 for(unsigned int i = 0; i < size; ++i) 
 {
 mul += A[i] * B[i] ;
 d_a += A[i] * A[i] ;
 d_b += B[i] * B[i] ;
 }
 return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Question 2

Depending on the size of A and B, passing by (const) reference could be faster, since there is less memory to be copied onto the stack.

Question 3

Also you could reverse the loop, as 'for(int i = size; i--){...}', so you dont have to compare i to size in every step.

Question 4

@M.Hoffmann Comments are for seeking clarification to the question. Please write all suggestions as answers.

Question 5

Check for Possible Errors

There are a number of possible errors in the code:

The variable size may be zero which will lead to division by zero.
The vectors A and B may not be the same length, if the size variable is the size of the shorter vector this is not a problem, but if the size variable is the size of the larger vector this can lead to unknown behavior.
Either d_a or d_b can add up to zero, which can lead to division by zero as @coderodde pointed out.

C++ provides the exception class std:exception that has multiple sub classes. Two of the classes are std::runtime_error and std::logic_error. Item 1 and 2 above belong in the std::logic_error class, item 3 belongs in the std::runtime_error class.

Code Readability and Maintainability

The code has:

double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

This is less readable and maintainable than putting the initialization on separate lines:

double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;

Direct Addressing Versus Indirect Addressing

The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers use the pointers rather than indexing them.

double cosine_similarity(double *A, double *B, unsigned int size)
{
 double mul = 0.0;
 double d_a = 0.0;
 double d_b = 0.0 ;
 for(unsigned int i = 0; i < size; ++i)
 {
 mul += *A * *B;
 d_a += *A * *A;
 d_b += *B * *B;
 A++;
 B++;
 }
 if (d_a == 0.0f || d_b == 0.0f)
 {
 throw std::logic_error(
 "cosine similarity is not defined whenever one or both "
 "input vectors are zero-vectors.");
 }
 return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Using Container Classes Can be Safer as Well as Efficient

The container classes such as std::vector and std::array provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector provides a flexible as well as convenient way to store values.

These container classes provide additional information such as the size of the array/vector and iterators for direct addressing. They also allow for indexing through the 'array'.

#include <math.h>
#include <vector>
#include <stdexcept>
double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
 double mul = 0.0;
 double d_a = 0.0;
 double d_b = 0.0 ;
 if (A.size() != B.size())
 {
 throw std::logic_error("Vector A and Vector B are not the same size");
 }
 // Prevent Division by zero
 if (A.size() < 1)
 {
 throw std::logic_error("Vector A and Vector B are empty");
 }
 std::vector<double>::iterator B_iter = B.begin();
 std::vector<double>::iterator A_iter = A.begin();
 for( ; A_iter != A.end(); A_iter++ , B_iter++ )
 {
 mul += *A_iter * *B_iter;
 d_a += *A_iter * *A_iter;
 d_b += *B_iter * *B_iter;
 }
 if (d_a == 0.0f || d_b == 0.0f)
 {
 throw std::logic_error(
 "cosine similarity is not defined whenever one or both "
 "input vectors are zero-vectors.");
 }
 return mul / (sqrt(d_a) * sqrt(d_b));
}

Question 6

As far as i know you can put B_iter and A_iter into the for loop as they are the same type. for( std::vector<double>::iterator A_iter = A.begin(), B_iter = B.begin(); A_iter != A.end(); A_iter++ , B_iter++ )

Question 7

Since $$\sqrt{D_a} \sqrt{D_b} = \sqrt{D_a D_b}$$ you can get rid of one call to sqrt

double cosine_similarity2(double *A, double *B, unsigned int size)
{
 double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
 for(unsigned int i = 0; i < size; ++i)
 {
 mul += A[i] * B[i] ;
 d_a += A[i] * A[i] ;
 d_b += B[i] * B[i] ;
 }
 if (d_a == 0 || d_b == 0)
 {
 throw runtime_error(
 "cosine similarity is not defined whenever one or both "
 "input vectors are zero-vectors.");
 }
 return mul / (sqrt(d_a * d_b)) ;
}

Minor

Usually people don't put a single space before the semicolon ;. However, usually a single space is put straight after for:

for (int i = 0; ..., ++i) 
{ ^
 ...
}

Divide by zero

Note that sqrt(d_a * d_b) == 0 only when at least one of the input vectors is a zero vector. However, cosine similarity assumes that the two input vectors are not zero-vectors, so it makes sense to me to throw an exception when it happens.

Divide by zero ++

As 200_success advised, you can omit the entire exception-throwing code and just proceed to division by zero: that will return a special NaN value ("Not a Number").

Question 8

How can I get rid of divide by zero as (d_a * d_b) can be zero if all the elements of the two array is all zero?

Question 9

1) I'm not convinced that an exception would be better than a NaN result. 2) sqrt(a * b) is not the same as sqrt(a) * sqrt(b) if a or b is negative.

Question 10

@200_success Take another look at \$a\$ or \$b\$. They can't be negative. However, thank you for your suggestion; I will add it to my answer.

Question 11

@200_success I am aware of existence of NaN, yet I would personally choose to avoid it since it "infects" everything it comes to operate with, yielding (silently) a NaN result.

Question 12

You might also want to benchmark the results of splitting the single loop into three, if this function will be used on large vectors or very frequently.

Question 13

Calculating cosine similarity between \$n\$ things only require calculating your square root (\$d\$) for \$n\$ times instead of \$n \space x \space n\$ times.

If you write a function that calculate cosine similarity between n things instead of just two things, you can save some time on calculating the square root.

Alternatively, write a function that calculate cosine similarity between two things with an array of the square roots as an argument.

pacmaninbw ♦pacmaninbw 26.1k13 gold badges47 silver badges113 bronze badges · Accepted Answer · 2016-09-24 14:40:28Z

Check for Possible Errors

There are a number of possible errors in the code:

The variable size may be zero which will lead to division by zero.
The vectors A and B may not be the same length, if the size variable is the size of the shorter vector this is not a problem, but if the size variable is the size of the larger vector this can lead to unknown behavior.
Either d_a or d_b can add up to zero, which can lead to division by zero as @coderodde pointed out.

C++ provides the exception class std:exception that has multiple sub classes. Two of the classes are std::runtime_error and std::logic_error. Item 1 and 2 above belong in the std::logic_error class, item 3 belongs in the std::runtime_error class.

Code Readability and Maintainability

The code has:

double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

This is less readable and maintainable than putting the initialization on separate lines:

double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;

Direct Addressing Versus Indirect Addressing

The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers use the pointers rather than indexing them.

double cosine_similarity(double *A, double *B, unsigned int size)
{
 double mul = 0.0;
 double d_a = 0.0;
 double d_b = 0.0 ;
 for(unsigned int i = 0; i < size; ++i)
 {
 mul += *A * *B;
 d_a += *A * *A;
 d_b += *B * *B;
 A++;
 B++;
 }
 if (d_a == 0.0f || d_b == 0.0f)
 {
 throw std::logic_error(
 "cosine similarity is not defined whenever one or both "
 "input vectors are zero-vectors.");
 }
 return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Using Container Classes Can be Safer as Well as Efficient

The container classes such as std::vector and std::array provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector provides a flexible as well as convenient way to store values.

These container classes provide additional information such as the size of the array/vector and iterators for direct addressing. They also allow for indexing through the 'array'.

#include <math.h>
#include <vector>
#include <stdexcept>
double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
 double mul = 0.0;
 double d_a = 0.0;
 double d_b = 0.0 ;
 if (A.size() != B.size())
 {
 throw std::logic_error("Vector A and Vector B are not the same size");
 }
 // Prevent Division by zero
 if (A.size() < 1)
 {
 throw std::logic_error("Vector A and Vector B are empty");
 }
 std::vector<double>::iterator B_iter = B.begin();
 std::vector<double>::iterator A_iter = A.begin();
 for( ; A_iter != A.end(); A_iter++ , B_iter++ )
 {
 mul += *A_iter * *B_iter;
 d_a += *A_iter * *A_iter;
 d_b += *B_iter * *B_iter;
 }
 if (d_a == 0.0f || d_b == 0.0f)
 {
 throw std::logic_error(
 "cosine similarity is not defined whenever one or both "
 "input vectors are zero-vectors.");
 }
 return mul / (sqrt(d_a) * sqrt(d_b));
}

As far as i know you can put B_iter and A_iter into the for loop as they are the same type. for( std::vector<double>::iterator A_iter = A.begin(), B_iter = B.begin(); A_iter != A.end(); A_iter++ , B_iter++ )

Stack Exchange Network

Cosine similarity

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Cosine similarity

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions