K-Mean with Numpy

Question 1

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

Question 2

What was wrong with scipy.cluster.vq.kmeans?

Question 3

I am preparing a presentation on the k-mean algorithm to educate people. (actually on PSO-Kmean, but explaining K-mean comes first). I have a nice clear Pseudo code for it, but I also want to show a efficient python implementation so that people who learn well off that (eg me) will learn. In short consider it a educational exercise.

Question 4

1. Code review

There's no documentation. What do your functions do and how am I supposed to call them?
There are no test cases. What gives you confidence that the algorithm is correct?
The algorithm is called "\$k\$-means" (not "\$k\$-mean") so I would name the functions accordingly.
The do_print mechanism looks as if it was added to help with debugging and then you forgot to remove it. Surely no user of this code is going to be interested in this? It would be better to use Python's built-in debugging facilities so that you don't need to add this kind of debugging code.
Similarly for the "After %i generations stabilised" message.
The steps in the \$k\$-means algorithm are not normally known as generations so you might want to choose a clearer name, for example steps or iterations.
The iterator range is exclusive at the top end. So range(1, n_gen) iterates from 1 up to ngen - 1 inclusive, which is probably not what you meant.
If n_gens is less than 2, kmean returns None instead of a list of clusters. It would be better to raise an exception in this case, so that the caller doesn't have to handle the exceptional return value. Alternatively, you could ensure that at least one step is taken. (Which is what I've done in my revised code below.)
Your chosen format for the return value of kmean (a list of clusters, each cluster being an array of datapoints) is likely to be inconvenient for further processing by NumPy. Generally in NumPy we want everything to be returned as uniform arrays. So if you do want to return the clusters (rather than the centroids), it would be better to return them in the form of an array giving the cluster number for each datapoint (that is, in the format of your cluster_allocs array).
Your implementation omits the initialization step (that is, the selection of initial candidate values for the centroids). It would make sense to include at least one initialization method (for example, Forgy) so that the burden is not imposed on the caller.
The kmean_step operations are hard to follow because the variables are not clearly named (what is ds? what is the e in e_dists) and because you have expressed all your loops in the form of list comprehensions. When a comprehension becomes so long and unwieldy that you feel the need to use obscure variable names like ci and cl, it's probably time to rewrite it as a loop.
The function kmean_step is not likely to be useful to the caller, so there's no good reason to put this part of the code in its own function. The component of the algorithm that's likely to be generally useful is the computation of the centroids of the clusters.
Pairwise distances can be computed using scipy.spatial.distance.cdist.
You don't need to take square roots: squared distances are sufficient to determine the closest centroid to each data point. So if you are using scipy.spatial.distance.cdist then you'd use the sqeuclidean metric.

2. Revised code

Note the docstrings, the example code (which you can run as test cases using the doctest module), and the simple and straightforward implementation. (It is a shame that there has to be a loop in cluster_centroids, but I don't know how to avoid it.)

import numpy as np
def cluster_centroids(data, clusters, k=None):
 """Return centroids of clusters in data.
 data is an array of observations with shape (A, B, ...).
 clusters is an array of integers of shape (A,) giving the index
 (from 0 to k-1) of the cluster to which each observation belongs.
 The clusters must all be non-empty.
 k is the number of clusters. If omitted, it is deduced from the
 values in the clusters array.
 The result is an array of shape (k, B, ...) containing the
 centroid of each cluster.
 >>> data = np.array([[12, 10, 87],
 ... [ 2, 12, 33],
 ... [68, 31, 32],
 ... [88, 13, 66],
 ... [79, 40, 89],
 ... [ 1, 77, 12]])
 >>> cluster_centroids(data, np.array([1, 1, 2, 2, 0, 1]))
 array([[ 79., 40., 89.],
 [ 5., 33., 44.],
 [ 78., 22., 49.]])
 """
 if k is None:
 k = np.max(clusters) + 1
 result = np.empty(shape=(k,) + data.shape[1:])
 for i in range(k):
 np.mean(data[clusters == i], axis=0, out=result[i])
 return result
import scipy.spatial
def kmeans(data, k=None, centroids=None, steps=20):
 """Divide the observations in data into clusters using the k-means
 algorithm, and return an array of integers assigning each data
 point to one of the clusters.
 centroids, if supplied, must be an array giving the initial
 position of the centroids of each cluster.
 If centroids is omitted, the number k gives the number of clusters
 and the initial positions of the centroids are selected randomly
 from the data.
 The k-means algorithm adjusts the centroids iteratively for the
 given number of steps, or until no further progress can be made.
 >>> data = np.array([[12, 10, 87],
 ... [ 2, 12, 33],
 ... [68, 31, 32],
 ... [88, 13, 66],
 ... [79, 40, 89],
 ... [ 1, 77, 12]])
 >>> np.random.seed(73)
 >>> kmeans(data, k=3)
 array([1, 1, 2, 2, 0, 1])
 """
 if centroids is not None and k is not None:
 assert(k == len(centroids))
 elif centroids is not None:
 k = len(centroids)
 elif k is not None:
 # Forgy initialization method: choose k data points randomly.
 centroids = data[np.random.choice(np.arange(len(data)), k, False)]
 else:
 raise RuntimeError("Need a value for k or centroids.")
 for _ in range(max(steps, 1)):
 # Squared distances between each point and each centroid.
 sqdists = scipy.spatial.distance.cdist(centroids, data, 'sqeuclidean')
 # Index of the closest centroid to each data point.
 clusters = np.argmin(sqdists, axis=0)
 new_centroids = cluster_centroids(data, clusters, k)
 if np.array_equal(new_centroids, centroids):
 break
 centroids = new_centroids
 return clusters

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2014-09-18 12:26:35Z

1. Code review

There's no documentation. What do your functions do and how am I supposed to call them?
There are no test cases. What gives you confidence that the algorithm is correct?
The algorithm is called "\$k\$-means" (not "\$k\$-mean") so I would name the functions accordingly.
The do_print mechanism looks as if it was added to help with debugging and then you forgot to remove it. Surely no user of this code is going to be interested in this? It would be better to use Python's built-in debugging facilities so that you don't need to add this kind of debugging code.
Similarly for the "After %i generations stabilised" message.
The steps in the \$k\$-means algorithm are not normally known as generations so you might want to choose a clearer name, for example steps or iterations.
The iterator range is exclusive at the top end. So range(1, n_gen) iterates from 1 up to ngen - 1 inclusive, which is probably not what you meant.
If n_gens is less than 2, kmean returns None instead of a list of clusters. It would be better to raise an exception in this case, so that the caller doesn't have to handle the exceptional return value. Alternatively, you could ensure that at least one step is taken. (Which is what I've done in my revised code below.)
Your chosen format for the return value of kmean (a list of clusters, each cluster being an array of datapoints) is likely to be inconvenient for further processing by NumPy. Generally in NumPy we want everything to be returned as uniform arrays. So if you do want to return the clusters (rather than the centroids), it would be better to return them in the form of an array giving the cluster number for each datapoint (that is, in the format of your cluster_allocs array).
Your implementation omits the initialization step (that is, the selection of initial candidate values for the centroids). It would make sense to include at least one initialization method (for example, Forgy) so that the burden is not imposed on the caller.
The kmean_step operations are hard to follow because the variables are not clearly named (what is ds? what is the e in e_dists) and because you have expressed all your loops in the form of list comprehensions. When a comprehension becomes so long and unwieldy that you feel the need to use obscure variable names like ci and cl, it's probably time to rewrite it as a loop.
The function kmean_step is not likely to be useful to the caller, so there's no good reason to put this part of the code in its own function. The component of the algorithm that's likely to be generally useful is the computation of the centroids of the clusters.
Pairwise distances can be computed using scipy.spatial.distance.cdist.
You don't need to take square roots: squared distances are sufficient to determine the closest centroid to each data point. So if you are using scipy.spatial.distance.cdist then you'd use the sqeuclidean metric.

2. Revised code

Note the docstrings, the example code (which you can run as test cases using the doctest module), and the simple and straightforward implementation. (It is a shame that there has to be a loop in cluster_centroids, but I don't know how to avoid it.)

import numpy as np
def cluster_centroids(data, clusters, k=None):
 """Return centroids of clusters in data.
 data is an array of observations with shape (A, B, ...).
 clusters is an array of integers of shape (A,) giving the index
 (from 0 to k-1) of the cluster to which each observation belongs.
 The clusters must all be non-empty.
 k is the number of clusters. If omitted, it is deduced from the
 values in the clusters array.
 The result is an array of shape (k, B, ...) containing the
 centroid of each cluster.
 >>> data = np.array([[12, 10, 87],
 ... [ 2, 12, 33],
 ... [68, 31, 32],
 ... [88, 13, 66],
 ... [79, 40, 89],
 ... [ 1, 77, 12]])
 >>> cluster_centroids(data, np.array([1, 1, 2, 2, 0, 1]))
 array([[ 79., 40., 89.],
 [ 5., 33., 44.],
 [ 78., 22., 49.]])
 """
 if k is None:
 k = np.max(clusters) + 1
 result = np.empty(shape=(k,) + data.shape[1:])
 for i in range(k):
 np.mean(data[clusters == i], axis=0, out=result[i])
 return result
import scipy.spatial
def kmeans(data, k=None, centroids=None, steps=20):
 """Divide the observations in data into clusters using the k-means
 algorithm, and return an array of integers assigning each data
 point to one of the clusters.
 centroids, if supplied, must be an array giving the initial
 position of the centroids of each cluster.
 If centroids is omitted, the number k gives the number of clusters
 and the initial positions of the centroids are selected randomly
 from the data.
 The k-means algorithm adjusts the centroids iteratively for the
 given number of steps, or until no further progress can be made.
 >>> data = np.array([[12, 10, 87],
 ... [ 2, 12, 33],
 ... [68, 31, 32],
 ... [88, 13, 66],
 ... [79, 40, 89],
 ... [ 1, 77, 12]])
 >>> np.random.seed(73)
 >>> kmeans(data, k=3)
 array([1, 1, 2, 2, 0, 1])
 """
 if centroids is not None and k is not None:
 assert(k == len(centroids))
 elif centroids is not None:
 k = len(centroids)
 elif k is not None:
 # Forgy initialization method: choose k data points randomly.
 centroids = data[np.random.choice(np.arange(len(data)), k, False)]
 else:
 raise RuntimeError("Need a value for k or centroids.")
 for _ in range(max(steps, 1)):
 # Squared distances between each point and each centroid.
 sqdists = scipy.spatial.distance.cdist(centroids, data, 'sqeuclidean')
 # Index of the closest centroid to each data point.
 clusters = np.argmin(sqdists, axis=0)
 new_centroids = cluster_centroids(data, clusters, k)
 if np.array_equal(new_centroids, centroids):
 break
 centroids = new_centroids
 return clusters

Stack Exchange Network

K-Mean with Numpy

Issues

1 Answer 1

1. Code review

2. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

K-Mean with Numpy

Issues

1 Answer 1

1. Code review

2. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions