Is there anything to make the cosine similarity function more optimized in terms of CPU execution time?
double cosine_similarity(double *A, double *B, unsigned int size)
{
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += A[i] * B[i] ;
d_a += A[i] * A[i] ;
d_b += B[i] * B[i] ;
}
return mul / (sqrt(d_a) * sqrt(d_b)) ;
}
-
\$\begingroup\$ Depending on the size of A and B, passing by (const) reference could be faster, since there is less memory to be copied onto the stack. \$\endgroup\$M.Hoffmann– M.Hoffmann2016年09月24日 11:41:10 +00:00Commented Sep 24, 2016 at 11:41
-
\$\begingroup\$ Also you could reverse the loop, as 'for(int i = size; i--){...}', so you dont have to compare i to size in every step. \$\endgroup\$M.Hoffmann– M.Hoffmann2016年09月24日 11:56:56 +00:00Commented Sep 24, 2016 at 11:56
-
\$\begingroup\$ @M.Hoffmann Comments are for seeking clarification to the question. Please write all suggestions as answers. \$\endgroup\$200_success– 200_success2016年09月24日 13:49:51 +00:00Commented Sep 24, 2016 at 13:49
3 Answers 3
Check for Possible Errors
There are a number of possible errors in the code:
- The variable
size
may be zero which will lead to division by zero. - The vectors A and B may not be the same length, if the
size
variable is the size of the shorter vector this is not a problem, but if thesize
variable is the size of the larger vector this can lead to unknown behavior. - Either d_a or d_b can add up to zero, which can lead to division by zero as @coderodde pointed out.
C++ provides the exception class std:exception that has multiple sub classes.
Two of the classes are std::runtime_error
and std::logic_error
. Item 1 and 2
above belong in the std::logic_error
class, item 3 belongs in the
std::runtime_error
class.
Code Readability and Maintainability
The code has:
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
This is less readable and maintainable than putting the initialization on separate lines:
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;
Direct Addressing Versus Indirect Addressing
The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers use the pointers rather than indexing them.
double cosine_similarity(double *A, double *B, unsigned int size)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += *A * *B;
d_a += *A * *A;
d_b += *B * *B;
A++;
B++;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b)) ;
}
Using Container Classes Can be Safer as Well as Efficient
The container classes such as std::vector
and std::array
provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector
provides a flexible as well as convenient way to store values.
These container classes provide additional information such as the size of the array/vector and iterators for direct addressing. They also allow for indexing through the 'array'.
#include <math.h>
#include <vector>
#include <stdexcept>
double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
if (A.size() != B.size())
{
throw std::logic_error("Vector A and Vector B are not the same size");
}
// Prevent Division by zero
if (A.size() < 1)
{
throw std::logic_error("Vector A and Vector B are empty");
}
std::vector<double>::iterator B_iter = B.begin();
std::vector<double>::iterator A_iter = A.begin();
for( ; A_iter != A.end(); A_iter++ , B_iter++ )
{
mul += *A_iter * *B_iter;
d_a += *A_iter * *A_iter;
d_b += *B_iter * *B_iter;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b));
}
-
1\$\begingroup\$ As far as i know you can put B_iter and A_iter into the for loop as they are the same type.
for( std::vector<double>::iterator A_iter = A.begin(), B_iter = B.begin(); A_iter != A.end(); A_iter++ , B_iter++ )
\$\endgroup\$miscco– miscco2016年09月25日 20:50:47 +00:00Commented Sep 25, 2016 at 20:50
Since $$\sqrt{D_a} \sqrt{D_b} = \sqrt{D_a D_b}$$
you can get rid of one call to sqrt
double cosine_similarity2(double *A, double *B, unsigned int size)
{
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += A[i] * B[i] ;
d_a += A[i] * A[i] ;
d_b += B[i] * B[i] ;
}
if (d_a == 0 || d_b == 0)
{
throw runtime_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a * d_b)) ;
}
Minor
Usually people don't put a single space before the semicolon ;
. However, usually a single space is put straight after for
:
for (int i = 0; ..., ++i)
{ ^
...
}
Divide by zero
Note that sqrt(d_a * d_b) == 0
only when at least one of the input vectors is a zero vector. However, cosine similarity assumes that the two input vectors are not zero-vectors, so it makes sense to me to throw an exception when it happens.
Divide by zero ++
As 200_success advised, you can omit the entire exception-throwing code and just proceed to division by zero: that will return a special NaN value ("Not a Number").
-
\$\begingroup\$ How can I get rid of divide by zero as (d_a * d_b) can be zero if all the elements of the two array is all zero? \$\endgroup\$MD. Nazmul Kibria– MD. Nazmul Kibria2016年09月24日 10:16:57 +00:00Commented Sep 24, 2016 at 10:16
-
1\$\begingroup\$ 1) I'm not convinced that an exception would be better than a NaN result. 2) sqrt(a * b) is not the same as sqrt(a) * sqrt(b) if a or b is negative. \$\endgroup\$200_success– 200_success2016年09月24日 13:39:20 +00:00Commented Sep 24, 2016 at 13:39
-
1\$\begingroup\$ @200_success Take another look at \$a\$ or \$b\$. They can't be negative. However, thank you for your suggestion; I will add it to my answer. \$\endgroup\$coderodde– coderodde2016年09月24日 13:41:48 +00:00Commented Sep 24, 2016 at 13:41
-
\$\begingroup\$ @200_success I am aware of existence of NaN, yet I would personally choose to avoid it since it "infects" everything it comes to operate with, yielding (silently) a NaN result. \$\endgroup\$coderodde– coderodde2016年09月24日 13:46:33 +00:00Commented Sep 24, 2016 at 13:46
-
\$\begingroup\$ You might also want to benchmark the results of splitting the single loop into three, if this function will be used on large vectors or very frequently. \$\endgroup\$Alex Reinking– Alex Reinking2016年09月24日 19:08:27 +00:00Commented Sep 24, 2016 at 19:08
Calculating cosine similarity between \$n\$ things only require calculating your square root (\$d\$) for \$n\$ times instead of \$n \space x \space n\$ times.
If you write a function that calculate cosine similarity between n things instead of just two things, you can save some time on calculating the square root.
Alternatively, write a function that calculate cosine similarity between two things with an array of the square roots as an argument.