I need to do a few hundred million euclidean distance calculations every day in a Python project.
Here is what I started out with:
#!/usr/bin/python
import numpy as np
def euclidean_dist_square(x, y):
diff = np.array(x) - np.array(y)
return np.dot(diff, diff)
This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.
#include "euclidean.h"
#include <math.h>
double euclidean(double x[128], double y[128])
{
double Sum;
for(int i=0;i<128;i++)
{
Sum = Sum + pow((x[i]-y[i]),2.0);
}
return Sum;
}
This is the relevant part of the extension, the complete code for the extension is here. Now this gives a nice speedup in comparison to the numpy version.
But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.
Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option.
-
\$\begingroup\$ Welcome to Code Review! This will likely get you a performance review - if you want your complete code peer reviewed, feel free to embed it into your question as well. \$\endgroup\$Mathieu Guindon– Mathieu Guindon2014年06月01日 20:09:38 +00:00Commented Jun 1, 2014 at 20:09
-
1\$\begingroup\$ You should also check if you can save calls to this function. Then you could see if there are assumptions you can make - will several x/y appear multiple times, is the data sorted, and things like that. What i am trying to say is, question the algorithm. You didn't really say what the purpose of the calculations was - clustering, nearest neighbor calculations, (?) \$\endgroup\$kutschkem– kutschkem2014年06月02日 12:58:33 +00:00Commented Jun 2, 2014 at 12:58
2 Answers 2
You can decrease Python's overhead by parsing the args
tuple and creating the return value manually.
Try changing these portions of your code:
/* Parse the input tuple */
if (!PyArg_ParseTuple(args, "OO", &x_obj, &y_obj))
return NULL;
<< code snipped >>
/* Build the output tuple */
PyObject *ret = Py_BuildValue("d", value);
return ret;
to this (untested) code:
/* Parse the input tuple manually */
if (PyTuple_GET_SIZE(args) != 2) {
PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
return NULL;
}
x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);
<< code snipped >>
/* Return double as Python float */
return PyFloat_FromDouble(value);
Updates
Without any changes, your code takes ~340ns on my computer. Changing the argument processing and how the return value is created decreases the running time to ~295ns. Replacing pow()
with multiplication didn't change the performance with my compiler (GCC 4.8.2). Changing the exponent to a non-integral value significantly increased the running time so I assume constant integer exponents are optimized by the compiler.
Next, I simplified the conversion of the numpy arrays by checking that the objects passed are numpy arrays. This avoids creating new objects. I also assumed that the data type is a "float64". Extra checks could be added if needed. With these changes, the running time is ~220ns. Code is below:
static PyObject *euclidean_euclidean(PyObject *self, PyObject *args)
{
PyObject *x_obj, *y_obj;
if (PyTuple_GET_SIZE(args) != 2) {
PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
return NULL;
}
x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);
if (!PyArray_Check(x_obj) || !PyArray_Check(y_obj)) {
PyErr_SetString(PyExc_TypeError, "requires array arguments");
return NULL;
}
/* Get pointers to the data as C-types. */
double *x = (double*)PyArray_DATA(x_obj);
double *y = (double*)PyArray_DATA(y_obj);
double value = euclidean(x, y);
if (value < 0.0) {
PyErr_SetString(PyExc_RuntimeError,
"Euclidean returned an impossible value.");
return NULL;
}
/* Build the output tuple */
return PyFloat_FromDouble(value);
}
If I remove the call to euclidean()
, the running time is ~75ns. If I remove all the the argument parsing and just return the value 0.0, the running time is ~72ns. I did a few more tests to confirm running times and Python's overhead is consistently ~75ns and the euclidean()
function has running time of ~150ns.
-
2\$\begingroup\$ My C is super rusty, but I really hope I didn't forget the precedence order between
!
and==
. The testif (!PyTuple_GET_SIZE(args) == 2)
can never pass. \$\endgroup\$David Harkness– David Harkness2014年06月02日 06:49:49 +00:00Commented Jun 2, 2014 at 6:49 -
\$\begingroup\$ Fantastic! This did indeed speed up the whole thing a good notch. \$\endgroup\$herrherr– herrherr2014年06月02日 11:52:33 +00:00Commented Jun 2, 2014 at 11:52
-
\$\begingroup\$ @DavidHarkness Thanks for spotting that. It should be correct now. \$\endgroup\$casevh– casevh2014年06月02日 12:22:59 +00:00Commented Jun 2, 2014 at 12:22
pow
is really slow, and its use is only warranted for a non-integer exponent. A simple squaring is better done manually:
delta = x[i] - y[i];
sum += delta * delta;
It is also worth mentioning that loops like your are a textbook example of GPGPU parallelism.
-
2\$\begingroup\$ GCC gets rid of the
pow
call. \$\endgroup\$herrherr– herrherr2014年06月02日 11:54:30 +00:00Commented Jun 2, 2014 at 11:54
Explore related questions
See similar questions with these tags.