Possible optimizations for calculating squared euclidean distance

Question 1

I need to do a few hundred million euclidean distance calculations every day in a Python project.

Here is what I started out with:

#!/usr/bin/python
import numpy as np
def euclidean_dist_square(x, y):
 diff = np.array(x) - np.array(y)
 return np.dot(diff, diff)

This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.

#include "euclidean.h"
#include <math.h>
double euclidean(double x[128], double y[128])
{
 double Sum;
 for(int i=0;i<128;i++)
 {
 Sum = Sum + pow((x[i]-y[i]),2.0);
 }
 return Sum;
}

This is the relevant part of the extension, the complete code for the extension is here. Now this gives a nice speedup in comparison to the numpy version.

But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.

Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option.

Question 2

Welcome to Code Review! This will likely get you a performance review - if you want your complete code peer reviewed, feel free to embed it into your question as well.

Question 3

You should also check if you can save calls to this function. Then you could see if there are assumptions you can make - will several x/y appear multiple times, is the data sorted, and things like that. What i am trying to say is, question the algorithm. You didn't really say what the purpose of the calculations was - clustering, nearest neighbor calculations, (?)

Question 4

You can decrease Python's overhead by parsing the args tuple and creating the return value manually.

Try changing these portions of your code:

/* Parse the input tuple */
if (!PyArg_ParseTuple(args, "OO", &x_obj, &y_obj))
 return NULL;
<< code snipped >>
/* Build the output tuple */
PyObject *ret = Py_BuildValue("d", value);
return ret;

to this (untested) code:

/* Parse the input tuple manually */
if (PyTuple_GET_SIZE(args) != 2) {
 PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
 return NULL;
}
x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);
<< code snipped >>
/* Return double as Python float */
return PyFloat_FromDouble(value);

Updates

Without any changes, your code takes ~340ns on my computer. Changing the argument processing and how the return value is created decreases the running time to ~295ns. Replacing pow() with multiplication didn't change the performance with my compiler (GCC 4.8.2). Changing the exponent to a non-integral value significantly increased the running time so I assume constant integer exponents are optimized by the compiler.

Next, I simplified the conversion of the numpy arrays by checking that the objects passed are numpy arrays. This avoids creating new objects. I also assumed that the data type is a "float64". Extra checks could be added if needed. With these changes, the running time is ~220ns. Code is below:

static PyObject *euclidean_euclidean(PyObject *self, PyObject *args)
{
 PyObject *x_obj, *y_obj;
 if (PyTuple_GET_SIZE(args) != 2) {
 PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
 return NULL;
 }
 x_obj = PyTuple_GET_ITEM(args, 0);
 y_obj = PyTuple_GET_ITEM(args, 1);
 if (!PyArray_Check(x_obj) || !PyArray_Check(y_obj)) {
 PyErr_SetString(PyExc_TypeError, "requires array arguments");
 return NULL;
 }
 /* Get pointers to the data as C-types. */
 double *x = (double*)PyArray_DATA(x_obj);
 double *y = (double*)PyArray_DATA(y_obj);
 double value = euclidean(x, y);
 if (value < 0.0) {
 PyErr_SetString(PyExc_RuntimeError,
 "Euclidean returned an impossible value.");
 return NULL;
 }
 /* Build the output tuple */
 return PyFloat_FromDouble(value);
}

If I remove the call to euclidean(), the running time is ~75ns. If I remove all the the argument parsing and just return the value 0.0, the running time is ~72ns. I did a few more tests to confirm running times and Python's overhead is consistently ~75ns and the euclidean() function has running time of ~150ns.

Question 5

My C is super rusty, but I really hope I didn't forget the precedence order between ! and ==. The test if (!PyTuple_GET_SIZE(args) == 2) can never pass.

Question 6

Fantastic! This did indeed speed up the whole thing a good notch.

Question 7

@DavidHarkness Thanks for spotting that. It should be correct now.

Question 8

pow is really slow, and its use is only warranted for a non-integer exponent. A simple squaring is better done manually:

delta = x[i] - y[i];
sum += delta * delta;

It is also worth mentioning that loops like your are a textbook example of GPGPU parallelism.

Question 9

GCC gets rid of the pow call.

casevh casevh 3513 silver badges3 bronze badges · Accepted Answer · 2014-06-01 20:55:30Z

You can decrease Python's overhead by parsing the args tuple and creating the return value manually.

Try changing these portions of your code:

/* Parse the input tuple */
if (!PyArg_ParseTuple(args, "OO", &x_obj, &y_obj))
 return NULL;
<< code snipped >>
/* Build the output tuple */
PyObject *ret = Py_BuildValue("d", value);
return ret;

to this (untested) code:

/* Parse the input tuple manually */
if (PyTuple_GET_SIZE(args) != 2) {
 PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
 return NULL;
}
x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);
<< code snipped >>
/* Return double as Python float */
return PyFloat_FromDouble(value);

Updates

Without any changes, your code takes ~340ns on my computer. Changing the argument processing and how the return value is created decreases the running time to ~295ns. Replacing pow() with multiplication didn't change the performance with my compiler (GCC 4.8.2). Changing the exponent to a non-integral value significantly increased the running time so I assume constant integer exponents are optimized by the compiler.

Next, I simplified the conversion of the numpy arrays by checking that the objects passed are numpy arrays. This avoids creating new objects. I also assumed that the data type is a "float64". Extra checks could be added if needed. With these changes, the running time is ~220ns. Code is below:

static PyObject *euclidean_euclidean(PyObject *self, PyObject *args)
{
 PyObject *x_obj, *y_obj;
 if (PyTuple_GET_SIZE(args) != 2) {
 PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
 return NULL;
 }
 x_obj = PyTuple_GET_ITEM(args, 0);
 y_obj = PyTuple_GET_ITEM(args, 1);
 if (!PyArray_Check(x_obj) || !PyArray_Check(y_obj)) {
 PyErr_SetString(PyExc_TypeError, "requires array arguments");
 return NULL;
 }
 /* Get pointers to the data as C-types. */
 double *x = (double*)PyArray_DATA(x_obj);
 double *y = (double*)PyArray_DATA(y_obj);
 double value = euclidean(x, y);
 if (value < 0.0) {
 PyErr_SetString(PyExc_RuntimeError,
 "Euclidean returned an impossible value.");
 return NULL;
 }
 /* Build the output tuple */
 return PyFloat_FromDouble(value);
}

If I remove the call to euclidean(), the running time is ~75ns. If I remove all the the argument parsing and just return the value 0.0, the running time is ~72ns. I did a few more tests to confirm running times and Python's overhead is consistently ~75ns and the euclidean() function has running time of ~150ns.

My C is super rusty, but I really hope I didn't forget the precedence order between ! and ==. The test if (!PyTuple_GET_SIZE(args) == 2) can never pass.
Fantastic! This did indeed speed up the whole thing a good notch.
@DavidHarkness Thanks for spotting that. It should be correct now.

Stack Exchange Network

Possible optimizations for calculating squared euclidean distance

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Possible optimizations for calculating squared euclidean distance

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions