Three-way comparing strings in Python 3

Question 1

Say you want to optimize a (byte) string compare intensive algorithm implemented in Python. Since a central code path contains this sequence of statements

if s < t:
 # less than ...
elif t < s:
 # greater than ...
else:
 # equal ...

it would be great to optimize it to something like

r = bytes_compare(s, t)
if r < 0:
 # less than ...
elif r > 0:
 # greater than ...
else:
 # equal ...

where (the hypothetical) bytes_compare() ideally would just call the three-way comparison C function memcmp() which is usually quite well optimized. This would reduce the number of string comparisons by half. A very feasible optimization unless the strings are ultra short.

But how to get there with Python 3?

PS:

Python 3 has removed the three way comparison global function cmp() and the magic method __cmp__(). And even with Python 2, the bytes class doesn't had a __cmp__() member.

With the ctypes package it's straight forward to call memcmp() but the foreign function call overhead with ctypes is prohibitively high.

Question 2

It sounds like you're doing string comparisons in a loop in Python. That will always be slow. You'd be better off engineering a proper fast solution using related tools like Cython, Pandas, etc. But doing that will require a holistic look at the surrounding code, not microbenchmarking a single string comparison.

Question 3

How is s < t comparison implemented atm, given that s and t are bytes?

Question 4

@JohnZwinck, well, I did some profiling with pyflame and this verified that a significant amount of time is spend in these paired comparisons (s with t followed by t with s). Note that this pattern isn't unusual - for example, you also have this when you merge/join 2 sorted sequences of strings. Also, Python string comparison aren't slow in general - if you just need one comparison outcome for a pair s and t then they are fast, as is. Thus, this really isn't about isolated microbenchmarking.

Question 5

@urban, with Python 3.6.5, there is Objects/bytesobject.c and it calls memcmp() in bytes_compare_eq() and bytes_richcompare() (for < or >).

Question 6

From codegolf.stackexchange.com/a/49779 where they discuss cmp() in python 3, the suggested way is to do: ((a > b) - (b > a)). This will give you the r but ... it is still 2 comparisons. I have tried a python-based implementation of bytes_compare... very bad idea :) at least 3x slower

Question 7

Python 3 (including 3.6) simply doesn't include any three-way comparison support for strings. Although the internal implementation of the rich comparison operator __lt__(), __eq__() etc. do call memcmp() (in the C implementation of bytes - cf. Objects/bytesobject.c) there is no internal three-way comparison function that could be leveraged.

Thus, writing a C extension that provides a three-way comparison function by calling memcmp() is the next best thing:

#include <Python.h>
static PyObject* cmp(PyObject* self, PyObject* args) {
 PyObject *a = 0, *b = 0;
 if (!PyArg_UnpackTuple(args, "cmp", 2, 2, &a, &b))
 return 0;
 if (!PyBytes_Check(a) || !PyBytes_Check(b)) {
 PyErr_SetString(PyExc_TypeError, "only bytes() strings supported");
 return 0;
 }
 Py_ssize_t n = PyBytes_GET_SIZE(a), m = PyBytes_GET_SIZE(b);
 char *s = PyBytes_AsString(a), *t = PyBytes_AsString(b);
 int r = 0;
 if (n == m) {
 r = memcmp(s, t, n);
 } else if (n < m) {
 r = memcmp(s, t, n);
 if (!r)
 r = -1;
 } else {
 r = memcmp(s, t, m);
 if (!r)
 r = 1;
 }
 return PyLong_FromLong(r);
}
static PyMethodDef bytes_util_methods[] = {
 { "cmp", cmp, METH_VARARGS, "Three way compare 2 bytes() objects." },
 {0,0,0,0} };
static struct PyModuleDef bytes_util_def = {
 PyModuleDef_HEAD_INIT, "bytes_util", "Three way comparison for strings.",
 -1, bytes_util_methods };
PyMODINIT_FUNC PyInit_bytes_util(void) {
 Py_Initialize();
 return PyModule_Create(&bytes_util_def);
}

Compile with:

gcc -Wall -O3 -fPIC -shared bytes_util.c -o bytes_util.so -I/usr/include/python3.6m

Test:

>>> import bytes_util
>>> bytes_util.cmp(b'foo', b'barx')
265725

In contrast to calling memcmp via the ctypes package, this foreign call has the same overhead as the builtin bytes comparison operators (as they also are implemented as C extension with the standard Python version).

Question 8

Note: To make this significantly more reusable with near zero overhead, replacing explicit use of the PyBytes APIs with the more general buffer interface (PyObject_GetBuffer with flags PyBUF_SIMPLE) would allow this to work seamlessly with bytearray, mmap, array.array('B'), and all manner of other bytes-like objects. You'd just declare a pair of Py_buffer structures on the stack, populate them w/PyObject_GetBuffer and your inputs, then use buf.buf and buf.len where you currently use s/t and m/n.

maxschlepzig maxschlepzig 40k16 gold badges166 silver badges216 bronze badges · Accepted Answer · 2018-06-15 19:37:44Z

Python 3 (including 3.6) simply doesn't include any three-way comparison support for strings. Although the internal implementation of the rich comparison operator __lt__(), __eq__() etc. do call memcmp() (in the C implementation of bytes - cf. Objects/bytesobject.c) there is no internal three-way comparison function that could be leveraged.

Thus, writing a C extension that provides a three-way comparison function by calling memcmp() is the next best thing:

#include <Python.h>
static PyObject* cmp(PyObject* self, PyObject* args) {
 PyObject *a = 0, *b = 0;
 if (!PyArg_UnpackTuple(args, "cmp", 2, 2, &a, &b))
 return 0;
 if (!PyBytes_Check(a) || !PyBytes_Check(b)) {
 PyErr_SetString(PyExc_TypeError, "only bytes() strings supported");
 return 0;
 }
 Py_ssize_t n = PyBytes_GET_SIZE(a), m = PyBytes_GET_SIZE(b);
 char *s = PyBytes_AsString(a), *t = PyBytes_AsString(b);
 int r = 0;
 if (n == m) {
 r = memcmp(s, t, n);
 } else if (n < m) {
 r = memcmp(s, t, n);
 if (!r)
 r = -1;
 } else {
 r = memcmp(s, t, m);
 if (!r)
 r = 1;
 }
 return PyLong_FromLong(r);
}
static PyMethodDef bytes_util_methods[] = {
 { "cmp", cmp, METH_VARARGS, "Three way compare 2 bytes() objects." },
 {0,0,0,0} };
static struct PyModuleDef bytes_util_def = {
 PyModuleDef_HEAD_INIT, "bytes_util", "Three way comparison for strings.",
 -1, bytes_util_methods };
PyMODINIT_FUNC PyInit_bytes_util(void) {
 Py_Initialize();
 return PyModule_Create(&bytes_util_def);
}

Compile with:

gcc -Wall -O3 -fPIC -shared bytes_util.c -o bytes_util.so -I/usr/include/python3.6m

Test:

>>> import bytes_util
>>> bytes_util.cmp(b'foo', b'barx')
265725

In contrast to calling memcmp via the ctypes package, this foreign call has the same overhead as the builtin bytes comparison operators (as they also are implemented as C extension with the standard Python version).

Note: To make this significantly more reusable with near zero overhead, replacing explicit use of the PyBytes APIs with the more general buffer interface (PyObject_GetBuffer with flags PyBUF_SIMPLE) would allow this to work seamlessly with bytearray, mmap, array.array('B'), and all manner of other bytes-like objects. You'd just declare a pair of Py_buffer structures on the stack, populate them w/PyObject_GetBuffer and your inputs, then use buf.buf and buf.len where you currently use s/t and m/n.

CollectivesTM on Stack Overflow

Three-way comparing strings in Python 3

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related