Long time coder, but very very new to C++
. I'm trying to put together my first real-sized project and figuring out what I need to change about my script-kiddie style of C++
to get there. Below is my working code.
#include <math.h>
#include <vector>
#include <iostream>
#include <stdexcept>
enum class Metric { Cosine, Euclidean};
double cosine(std::vector<double> A, std::vector<double> B, bool similarity){
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;
std::vector<double>::iterator A_iter = A.begin();
std::vector<double>::iterator B_iter = B.begin();
for ( ; A_iter != A.end(); A_iter++ , B_iter++ ) {
mul += *A_iter * *B_iter;
d_a += *A_iter * *A_iter;
d_b += *B_iter * *B_iter;
}
if (d_a == 0.0f || d_b == 0.0f) {
throw std::logic_error("Vectors must have a length greater than zero");
}
double cos_similarity = mul / sqrt(d_a * d_b);
if (similarity == false) {
return acos(cos_similarity) / M_PI;
} else {
return cos_similarity;
}
}
double euclidean(std::vector<double> A, std::vector<double>B, bool similarity) {
double sq_dist = 0.0;
std::vector<double>::iterator A_iter = A.begin();
std::vector<double>::iterator B_iter = B.begin();
for ( ; A_iter != A.end(); A_iter++, B_iter++) {
sq_dist += pow(*A_iter - *B_iter, 2);
}
if (similarity == false) {
return sqrt(sq_dist);
} else {
return 1.0/sqrt(sq_dist);
}
}
double distance(std::vector<double> A, std::vector<double> B, Metric metric, bool similarity = false) {
if (A.size() != B.size()) {
throw std::logic_error("A and B must be the same size!");
}
if (A.size() < 1) {
throw std::logic_error("Empty vectors belong to no metric space");
}
switch (metric) {
case Metric::Cosine:
return cosine(A, B, similarity);
case Metric::Euclidean:
return euclidean(A, B, similarity);
}
}
double similarity(std::vector<double> A, std::vector<double> B, Metric metric) {
return distance(A, B, metric, true);
}
My first question is around exactly how I should construct the header file. Specifically curious about the right way to handle the enum
. Secondly, any general feedback, things that I could clean up, cut out, etc... would be greatly appreciated.
3 Answers 3
Prefer C++ headers
If you #include <cmath>
instead of <math.h>
, the mathematical functions will be safely and unambiguously in the std
namespace.
Pass read-only parameters by const reference
Instead of copying the input vectors, it's usually more efficient to pass const references to them:
double cosine(const std::vector<double>& A, const std::vector<double>& B, bool similarity);
double euclidean(const std::vector<double>& A, const std::vector<double>& B, bool similarity);
double distance(const std::vector<double>& A, const std::vector<double>& B,
Metric metric, bool similarity = false);
Use auto
to deduce types
Instead of writing
std::vector<double>::iterator A_iter = A.begin();
std::vector<double>::iterator B_iter = B.begin();
We can ask the compiler to deduce the correct type:
auto A_iter = A.begin();
auto B_iter = B.begin();
This also makes it easier if you ever want to change the type of A
and B
(perhaps even to a template argument type). In fact, by changing to const references, we already changed the type (to std::vector<double>::const_iterator
), but auto
isolates us from that.
Use the right kind of exception
You could be more specific, and use std::invalid_argument
for the error type here:
if (d_a == 0.0 || d_b == 0.0) {
throw std::invalid_argument("Vectors must have a magnitude greater than zero");
}
I've changed the wording above, because length
usually means the number of elements when we're using standard collections.
I also changed the zero constant to be a double
, the same as d_a
and d_b
, since it will get promoted for the comparison anyway. Only use float
when you really need the reduced precision.
Don't compare booleans against false
This is hard to read:
if (similarity == false) {
// negative case
} else {
// positive case
}
You can re-write it as
if (similarity) {
// positive case
} else {
// negative case
}
This will normally produce identical code, but the cognitive overhead is reduced.
Avoid non-standard extensions
Although many implementations define a constant called M_PI
, this isn't specified by any standard. Thankfully it's easy to make your own value, using the knowledge that tan(1⁄4π) is 1:
static const auto pi = 4*std::atan(1);
Avoid std::pow(x, 2)
(削除) To square a number, it's simpler and more efficient to multiply it by itself. pow()
is very general (handling non-integer powers), and usually works by multiplication of logarithms. (削除ここまで)
Even though std::pow()
has overloads for integer powers, I find that for squaring, it's easier to read a simple multiplication rather than a function call. It's worth introducing a temporary to avoid that:
//sq_dist += pow(*A_iter - *B_iter, 2);
auto dist = *A_iter - *B_iter;
sq_dist += dist * dist;
Don't fall off the end of non-void functions
Here, we're assuming that we're always passed a valid enum constant:
switch (metric) {
case Metric::Cosine:
return cosine(A, B, similarity);
case Metric::Euclidean:
return euclidean(A, B, similarity);
}
// implicit return
Robust code checks that we don't flow out of the switch
:
switch (metric) {
case Metric::Cosine:
return cosine(A, B, similarity);
case Metric::Euclidean:
return euclidean(A, B, similarity);
}
throw std::invalid_argument("Invalid metric");
Invalid enum values shouldn't happen, but it costs little to be defensive against a mistaken cast.
Design issues
It's not clear whether cosine()
and euclidean()
are intended to be part of the public interface. If they are not, then give them internal linkage (e.g. by use of the anonymous namespace, or with the static
keyword), and consider moving the similarity
parameter out of them.
I prefer to avoid booleans that change the meaning of functions like that; see if you agree when you see my full example.
Here's a full version with the above issues all addressed (and assuming that cosine()
and euclidean()
are intended to be internal detail).
Declarations
This is what I'd put in the header file (let's call it similarity.h
):
#ifndef SIMILARITY_H
#define SIMILARITY_H
#include <vector>
enum class Metric { Cosine, Euclidean };
double distance(const std::vector<double>&, const std::vector<double>&, Metric);
double similarity(const std::vector<double>&, const std::vector<double>&, Metric);
#endif
Implementation
Having included the header, here's the C++ file:
#include "similarity.h"
#include <cmath>
#include <stdexcept>
namespace {
double cosine_distance(const std::vector<double>& a, const std::vector<double>& b)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;
auto ia = a.begin();
auto ib = b.begin();
for (; ia != a.end(); ia++, ib++) {
mul += *ia * *ib;
d_a += *ia * *ia;
d_b += *ib * *ib;
}
if (d_a == 0.0 || d_b == 0.0) {
throw std::invalid_argument("Vectors must have a magnitude greater than zero");
}
return mul / std::sqrt(d_a * d_b);
}
double euclidean_distance(const std::vector<double>& a, const std::vector<double>& b)
{
double sq_dist = 0.0;
auto ia = a.begin();
auto ib = b.begin();
for (; ia != a.end(); ia++, ib++) {
auto dist = *ia - *ib;
sq_dist += dist * dist;
}
return std::sqrt(sq_dist);
}
}
double distance(const std::vector<double>& a, const std::vector<double>& b, Metric metric)
{
if (a.size() != b.size())
throw std::invalid_argument("a and b must be the same size!");
if (a.size() < 1)
throw std::invalid_argument("Empty vectors belong to no metric space");
switch (metric) {
case Metric::Cosine:
return cosine_distance(a, b);
case Metric::Euclidean:
return euclidean_distance(a, b);
}
throw std::invalid_argument("Invalid metric");
}
double similarity(const std::vector<double>& a, const std::vector<double>& b, Metric metric)
{
static const auto pi = 4*std::atan(1);
switch (metric) {
case Metric::Cosine:
return std::acos(distance(a, b, metric)) / pi;
case Metric::Euclidean:
return 1.0 / distance(a, b, metric);
}
throw std::invalid_argument("Invalid metric");
}
Note that I include our own header first. That's a helpful approach, to improve our chances of discovering if it accidentally depends on something it doesn't explicitly include.
-
1\$\begingroup\$ There is nothing wrong with
std::pow(x, 2)
, it has overloads for integer exponents. As you can see here, the compiler will happily optimize that to a single multiplication, even at just-O1
. \$\endgroup\$G. Sliepen– G. Sliepen2021年10月24日 08:39:04 +00:00Commented Oct 24, 2021 at 8:39 -
\$\begingroup\$ Thanks for the heads-up there @G.Sliepen. I'd complete overlooked that, as a former C programmer. \$\endgroup\$Toby Speight– Toby Speight2021年10月24日 09:10:08 +00:00Commented Oct 24, 2021 at 9:10
You should not copy all functions into your header file. Pick the functions you need to link with other classes in your application.
For example, if you don't need the cosine
function elsewhere, leave it in your source file. However, I believe you will need to define similarity
in your header file, otherwise the rest of your application can't compute the similarity.
A possibility would look like:
#ifndef MY_HEADER_FILE
#define MY_HEADER_FILE
enum class Metric { Cosine, Euclidean};
// Add your comments
double euclidean(std::vector<double> A, std::vector<double>B, bool similarity);
// Add your comments
double similarity(std::vector<double> A, std::vector<double> B, Metric metric);
#endif
There is nothing wrong how you define the enum
. You may also want to wrap everything in a struct
, that's how OO should work.
-
\$\begingroup\$ Functions that are used internally but not mentioned in the header file ought to be declared
static
so they really are private to that translation unit. \$\endgroup\$Toby Speight– Toby Speight2019年02月04日 09:35:29 +00:00Commented Feb 4, 2019 at 9:35
Here, we're using an enum solely to select between different implementations. If we're not using it for anything else, then it may make more sense to pass a class object that provides those implementations.
We start with an abstract class that defines the interface:
#include <vector>
class metric_space
{
public:
virtual double distance(const std::vector<double>&,
const std::vector<double>&) const = 0;
virtual double similarity(const std::vector<double>&,
const std::vector<double>&) const = 0;
virtual ~metric_space() = default;
};
The virtual ... = 0
are what make this class an abstract one. We can't create objects of this class, but only derive from it.
We add a virtual destructor, which is a good practice for all classes intended to be used as base classes. It ensures that the correct sub-class destructor is executed when a base-class reference is used. I'll not go into more details, as you can read more about this in any good C++ book or web-site.
Now we define a pair of classes that derive from this interface. This is very straightforward; we'll use the override
keyword to ensure we get error messages if we fail to exactly match the base-class method signatures.
class cosine_metric_space : public metric_space
{
public:
double distance(const std::vector<double>&,
const std::vector<double>&) const override;
double similarity(const std::vector<double>&,
const std::vector<double>&) const override;
};
class euclidean_metric_space : public metric_space
{
public:
double distance(const std::vector<double>&,
const std::vector<double>&) const override;
double similarity(const std::vector<double>&,
const std::vector<double>&) const override;
};
That's the public header complete. It looks a bit more complicated than the API that we previously had, but it's still quite usable. Now, we need an instance of one or the other metric space classes instead of an enum value, and we use the method call syntax (e.g. space.distance(a, b)
) rather than passing another value.
The implementation is where we have a real win, even here with just two methods and two different metric spaces (as soon as we increase that number, the benefits increase dramatically).
We start with the required headers, and a private utility function for some shared functionality (I'm basing this code from my already-improved version in my other answer:
#include "similarity.h"
#include <cmath>
#include <stdexcept>
namespace {
void verify_metric_vectors(const std::vector<double>& a,
const std::vector<double>& b) {
if (a.size() != b.size())
throw std::invalid_argument("a and b must be the same size!");
if (a.size() < 1)
throw std::invalid_argument("Empty vectors belong to no metric space");
}
}
Then the implementations of the first class's methods:
// Cosine metric space
double cosine_metric_space::distance(const std::vector<double>& a,
const std::vector<double>& b) const
{
verify_metric_vectors(a, b);
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;
auto ia = a.begin();
auto ib = b.begin();
for (; ia != a.end(); ia++, ib++) {
mul += *ia * *ib;
d_a += *ia * *ia;
d_b += *ib * *ib;
}
if (d_a == 0.0 || d_b == 0.0) {
throw std::invalid_argument("Vectors must have a magnitude greater than zero");
// or perhaps we should remove this test, and just let the
// division yield the correct ±∞ here?
}
return mul / std::sqrt(d_a * d_b);
}
double cosine_metric_space::similarity(const std::vector<double>& a,
const std::vector<double>& b) const
{
static const auto pi = 4*std::atan(1);
return std::acos(distance(a, b)) / pi;
}
Followed by the second class's methods:
// Euclidean metric space
double euclidean_metric_space::distance(const std::vector<double>& a,
const std::vector<double>& b) const
{
verify_metric_vectors(a, b);
double sq_dist = 0.0;
auto ia = a.begin();
auto ib = b.begin();
for (; ia != a.end(); ia++, ib++) {
auto dist = *ia - *ib;
sq_dist += dist * dist;
}
return std::sqrt(sq_dist);
}
double euclidean_metric_space::similarity(const std::vector<double>& a,
const std::vector<double>& b) const
{
return 1.0 / distance(a, b);
}
See how the two implementations are now no longer entwined with each other, and we can reason about each one individually? And if we added a third, we could do that without any risk of breaking the existing ones.
main()
, there's very little to go on. See How to Ask for some advice to give your question a better title. \$\endgroup\$main
here would orient feedback toward an aspect of this code that I'm not particularly curious about. There's an interesting meta question here, but as it stand I really believe I've asked an appropriate question. \$\endgroup\$main()
would be nice (for example, if it included test cases), but it's not required. \$\endgroup\$