5
\$\begingroup\$

This is k-means implementation using Python (numpy). I believe there is room for improvement when it comes to computing distances (given I'm using a list comprehension, maybe I could also pack it in a numpy operation) and to compute the centroids using label-wise means (which I think also may be packed in a numpy operation).

import numpy as np
def k_means(data, k=2, max_iter=100):
 """Assigns data points into clusters using the k-means algorithm.
 Parameters
 ----------
 data : ndarray
 A 2D array containing data points to be clustered.
 k : int, optional
 Number of clusters (default = 2).
 max_iter : int, optional
 Number of maximum iterations
 Returns
 -------
 labels : ndarray
 A 1D array of labels for their respective input data points.
 """
 # data_max/data_min : array containing column-wise maximum/minimum values
 data_max = np.max(data, axis=0)
 data_min = np.min(data, axis=0)
 n_samples = data.shape[0]
 n_features = data.shape[1]
 # labels : array containing labels for data points, randomly initialized
 labels = np.random.randint(low=0, high=k, size=n_samples)
 # centroids : 2D containing centroids for the k-means algorithm
 # randomly initialized s.t. data_min <= centroid < data_max
 centroids = np.random.uniform(low=0., high=1., size=(k, n_features))
 centroids = centroids * (data_max - data_min) + data_min
 # k-means algorithm
 for i in range(max_iter):
 # distances : between datapoints and centroids
 distances = np.array(
 [np.linalg.norm(data - c, axis=1) for c in centroids])
 # new_labels : computed by finding centroid with minimal distance
 new_labels = np.argmin(distances, axis=0)
 if (labels == new_labels).all():
 # labels unchanged
 labels = new_labels
 print('Labels unchanged ! Terminating k-means.')
 break
 else:
 # labels changed
 # difference : percentage of changed labels
 difference = np.mean(labels != new_labels)
 print('%4f%% labels changed' % (difference * 100))
 labels = new_labels
 for c in range(k):
 # computing centroids by taking the mean over associated data points
 centroids[c] = np.mean(data[labels == c], axis=0)
 return labels
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Oct 7, 2018 at 10:05
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

The code looks good, I wouldn't suggest anything as needing to be changed. In line with your question about possible ways to use Numpy better, I'll offer a few things you could try:

  • You can pass low/high bounding arrays into numpy.randint, like .randint(low=data_min, high=data_max, size=(k, n_dimensions), rather than adjust the dimension-specific bounds after the fact. This is only done once, so it doesn't matter w.r.t. performance, but it cleans the code up a little.

  • You can pack the distance computations into purely numpy code, but I don't know if I'd recommend it. Generally, K is very small compared to the size of your data, so your list comprehension is extremely cheap and readable. The alternative numpy code would abuse broadcasting to trick numpy into doing the looping for you, and would look something like this:

    distances = np.linalg.norm(np.expand_dims(d, 2) - np.expand_dims(centroids.T, 0), axis=1)
    new_labels = distances.argmin(axis=1)
    

    I wouldn't suggest using that code, though, since its performance gains are likely minimal (you can check for your own use case) and it's much less readable than what you currently have.

  • You're right to be concerned about the centroid computation, since your current code is effectively O(K * N), whereas there's clearly a solution that's just O(N). The challenge, however, is to do so without losing the simplicity/readability of your code combined with numpy's inherent speed. Here's an interesting discussion about techniques to do a similar problem. Of those approaches, my quick tests say that the np.bincount method is fastest, but it doesn't work with multi-dimensional "weights", which is essentially what we need. My only thought that might be faster than your current approach is to write the code simply, then use Numba to JIT compile it to get C-level performance. That code would look like:

    @numba.jit
    def mean_vecs_by_group(mat, groups, num_groups=None):
     sum_vecs = np.zeros(shape=(num_groups, mat.shape[1]))
     counts = np.zeros(shape=(num_groups,))
     for g, vec in zip(groups, mat):
     sum_vecs[g] += vec
     counts[g] += 1
     return sum_vecs / np.expand_dims(counts, -1)
    

    However, in my test runs, this performs 1.5-10x slower than your current numpy code. I don't know of any other way in Numpy to do this all in one iteration through the array (if someone else knows, please comment). I suspect that in most cases, your current code is both the most maintainable solution and has very competitive performance.

answered Oct 8, 2018 at 15:14
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.