k-means using numpy

Question 1

This is k-means implementation using Python (numpy). I believe there is room for improvement when it comes to computing distances (given I'm using a list comprehension, maybe I could also pack it in a numpy operation) and to compute the centroids using label-wise means (which I think also may be packed in a numpy operation).

import numpy as np
def k_means(data, k=2, max_iter=100):
 """Assigns data points into clusters using the k-means algorithm.
 Parameters
 ----------
 data : ndarray
 A 2D array containing data points to be clustered.
 k : int, optional
 Number of clusters (default = 2).
 max_iter : int, optional
 Number of maximum iterations
 Returns
 -------
 labels : ndarray
 A 1D array of labels for their respective input data points.
 """
 # data_max/data_min : array containing column-wise maximum/minimum values
 data_max = np.max(data, axis=0)
 data_min = np.min(data, axis=0)
 n_samples = data.shape[0]
 n_features = data.shape[1]
 # labels : array containing labels for data points, randomly initialized
 labels = np.random.randint(low=0, high=k, size=n_samples)
 # centroids : 2D containing centroids for the k-means algorithm
 # randomly initialized s.t. data_min <= centroid < data_max
 centroids = np.random.uniform(low=0., high=1., size=(k, n_features))
 centroids = centroids * (data_max - data_min) + data_min
 # k-means algorithm
 for i in range(max_iter):
 # distances : between datapoints and centroids
 distances = np.array(
 [np.linalg.norm(data - c, axis=1) for c in centroids])
 # new_labels : computed by finding centroid with minimal distance
 new_labels = np.argmin(distances, axis=0)
 if (labels == new_labels).all():
 # labels unchanged
 labels = new_labels
 print('Labels unchanged ! Terminating k-means.')
 break
 else:
 # labels changed
 # difference : percentage of changed labels
 difference = np.mean(labels != new_labels)
 print('%4f%% labels changed' % (difference * 100))
 labels = new_labels
 for c in range(k):
 # computing centroids by taking the mean over associated data points
 centroids[c] = np.mean(data[labels == c], axis=0)
 return labels

Question 2

The code looks good, I wouldn't suggest anything as needing to be changed. In line with your question about possible ways to use Numpy better, I'll offer a few things you could try:

You can pass low/high bounding arrays into numpy.randint, like .randint(low=data_min, high=data_max, size=(k, n_dimensions), rather than adjust the dimension-specific bounds after the fact. This is only done once, so it doesn't matter w.r.t. performance, but it cleans the code up a little.
You can pack the distance computations into purely numpy code, but I don't know if I'd recommend it. Generally, K is very small compared to the size of your data, so your list comprehension is extremely cheap and readable. The alternative numpy code would abuse broadcasting to trick numpy into doing the looping for you, and would look something like this:
```
distances = np.linalg.norm(np.expand_dims(d, 2) - np.expand_dims(centroids.T, 0), axis=1)
new_labels = distances.argmin(axis=1)
```
I wouldn't suggest using that code, though, since its performance gains are likely minimal (you can check for your own use case) and it's much less readable than what you currently have.
You're right to be concerned about the centroid computation, since your current code is effectively O(K * N), whereas there's clearly a solution that's just O(N). The challenge, however, is to do so without losing the simplicity/readability of your code combined with numpy's inherent speed. Here's an interesting discussion about techniques to do a similar problem. Of those approaches, my quick tests say that the np.bincount method is fastest, but it doesn't work with multi-dimensional "weights", which is essentially what we need. My only thought that might be faster than your current approach is to write the code simply, then use Numba to JIT compile it to get C-level performance. That code would look like:
```
@numba.jit
def mean_vecs_by_group(mat, groups, num_groups=None):
 sum_vecs = np.zeros(shape=(num_groups, mat.shape[1]))
 counts = np.zeros(shape=(num_groups,))
 for g, vec in zip(groups, mat):
 sum_vecs[g] += vec
 counts[g] += 1
 return sum_vecs / np.expand_dims(counts, -1)
```
However, in my test runs, this performs 1.5-10x slower than your current numpy code. I don't know of any other way in Numpy to do this all in one iteration through the array (if someone else knows, please comment). I suspect that in most cases, your current code is both the most maintainable solution and has very competitive performance.

scnerd 2,0907 silver badges13 bronze badges · Answer 1 · 2018-10-08 15:14:41Z

The code looks good, I wouldn't suggest anything as needing to be changed. In line with your question about possible ways to use Numpy better, I'll offer a few things you could try:

You can pass low/high bounding arrays into numpy.randint, like .randint(low=data_min, high=data_max, size=(k, n_dimensions), rather than adjust the dimension-specific bounds after the fact. This is only done once, so it doesn't matter w.r.t. performance, but it cleans the code up a little.
You can pack the distance computations into purely numpy code, but I don't know if I'd recommend it. Generally, K is very small compared to the size of your data, so your list comprehension is extremely cheap and readable. The alternative numpy code would abuse broadcasting to trick numpy into doing the looping for you, and would look something like this:
```
distances = np.linalg.norm(np.expand_dims(d, 2) - np.expand_dims(centroids.T, 0), axis=1)
new_labels = distances.argmin(axis=1)
```
I wouldn't suggest using that code, though, since its performance gains are likely minimal (you can check for your own use case) and it's much less readable than what you currently have.
You're right to be concerned about the centroid computation, since your current code is effectively O(K * N), whereas there's clearly a solution that's just O(N). The challenge, however, is to do so without losing the simplicity/readability of your code combined with numpy's inherent speed. Here's an interesting discussion about techniques to do a similar problem. Of those approaches, my quick tests say that the np.bincount method is fastest, but it doesn't work with multi-dimensional "weights", which is essentially what we need. My only thought that might be faster than your current approach is to write the code simply, then use Numba to JIT compile it to get C-level performance. That code would look like:
```
@numba.jit
def mean_vecs_by_group(mat, groups, num_groups=None):
 sum_vecs = np.zeros(shape=(num_groups, mat.shape[1]))
 counts = np.zeros(shape=(num_groups,))
 for g, vec in zip(groups, mat):
 sum_vecs[g] += vec
 counts[g] += 1
 return sum_vecs / np.expand_dims(counts, -1)
```
However, in my test runs, this performs 1.5-10x slower than your current numpy code. I don't know of any other way in Numpy to do this all in one iteration through the array (if someone else knows, please comment). I suspect that in most cases, your current code is both the most maintainable solution and has very competitive performance.

Stack Exchange Network

k-means using numpy

1 Answer 1

You must log in to answer this question.

Hot Network Questions

k-means using numpy

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions