This is k-means implementation using Python (numpy). I believe there is room for improvement when it comes to computing distances (given I'm using a list comprehension, maybe I could also pack it in a numpy operation) and to compute the centroids using label-wise means (which I think also may be packed in a numpy operation).
import numpy as np
def k_means(data, k=2, max_iter=100):
"""Assigns data points into clusters using the k-means algorithm.
Parameters
----------
data : ndarray
A 2D array containing data points to be clustered.
k : int, optional
Number of clusters (default = 2).
max_iter : int, optional
Number of maximum iterations
Returns
-------
labels : ndarray
A 1D array of labels for their respective input data points.
"""
# data_max/data_min : array containing column-wise maximum/minimum values
data_max = np.max(data, axis=0)
data_min = np.min(data, axis=0)
n_samples = data.shape[0]
n_features = data.shape[1]
# labels : array containing labels for data points, randomly initialized
labels = np.random.randint(low=0, high=k, size=n_samples)
# centroids : 2D containing centroids for the k-means algorithm
# randomly initialized s.t. data_min <= centroid < data_max
centroids = np.random.uniform(low=0., high=1., size=(k, n_features))
centroids = centroids * (data_max - data_min) + data_min
# k-means algorithm
for i in range(max_iter):
# distances : between datapoints and centroids
distances = np.array(
[np.linalg.norm(data - c, axis=1) for c in centroids])
# new_labels : computed by finding centroid with minimal distance
new_labels = np.argmin(distances, axis=0)
if (labels == new_labels).all():
# labels unchanged
labels = new_labels
print('Labels unchanged ! Terminating k-means.')
break
else:
# labels changed
# difference : percentage of changed labels
difference = np.mean(labels != new_labels)
print('%4f%% labels changed' % (difference * 100))
labels = new_labels
for c in range(k):
# computing centroids by taking the mean over associated data points
centroids[c] = np.mean(data[labels == c], axis=0)
return labels
1 Answer 1
The code looks good, I wouldn't suggest anything as needing to be changed. In line with your question about possible ways to use Numpy better, I'll offer a few things you could try:
You can pass low/high bounding arrays into
numpy.randint
, like.randint(low=data_min, high=data_max, size=(k, n_dimensions)
, rather than adjust the dimension-specific bounds after the fact. This is only done once, so it doesn't matter w.r.t. performance, but it cleans the code up a little.You can pack the distance computations into purely numpy code, but I don't know if I'd recommend it. Generally, K is very small compared to the size of your data, so your list comprehension is extremely cheap and readable. The alternative numpy code would abuse broadcasting to trick numpy into doing the looping for you, and would look something like this:
distances = np.linalg.norm(np.expand_dims(d, 2) - np.expand_dims(centroids.T, 0), axis=1) new_labels = distances.argmin(axis=1)
I wouldn't suggest using that code, though, since its performance gains are likely minimal (you can check for your own use case) and it's much less readable than what you currently have.
You're right to be concerned about the centroid computation, since your current code is effectively O(K * N), whereas there's clearly a solution that's just O(N). The challenge, however, is to do so without losing the simplicity/readability of your code combined with numpy's inherent speed. Here's an interesting discussion about techniques to do a similar problem. Of those approaches, my quick tests say that the
np.bincount
method is fastest, but it doesn't work with multi-dimensional "weights", which is essentially what we need. My only thought that might be faster than your current approach is to write the code simply, then use Numba to JIT compile it to get C-level performance. That code would look like:@numba.jit def mean_vecs_by_group(mat, groups, num_groups=None): sum_vecs = np.zeros(shape=(num_groups, mat.shape[1])) counts = np.zeros(shape=(num_groups,)) for g, vec in zip(groups, mat): sum_vecs[g] += vec counts[g] += 1 return sum_vecs / np.expand_dims(counts, -1)
However, in my test runs, this performs 1.5-10x slower than your current numpy code. I don't know of any other way in Numpy to do this all in one iteration through the array (if someone else knows, please comment). I suspect that in most cases, your current code is both the most maintainable solution and has very competitive performance.
Explore related questions
See similar questions with these tags.