I have implemented the K-Mean clustering Algorithm in Numpy:
from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
ds = centroids[:,np.newaxis]-datapoints
e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
cluster_allocs = np.argmin(e_dists, axis=0)
clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
clusters=None
for ii in range(1,n_gen):
new_centroids, clusters = kmean_step(centroids, datapoints)
if np.array_equal(centroids,new_centroids):
print "After %i generations stabalised" % ii
break
else:
centroids = new_centroids
if do_print:
print_clusters(clusters,centroids)
return clusters
I have left out the print_clusters
method. It as expected, displays the current clusters using coloured plots in matplotlib
.
Issues
kmean_step
seems really ugly for an algorithm that can be described by:- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
- I am a bit unsure about the termination conditions in the
kmean
function. I have read that they are:- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
- I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
- Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a
ndarrray
.
1 Answer 1
1. Code review
There's no documentation. What do your functions do and how am I supposed to call them?
There are no test cases. What gives you confidence that the algorithm is correct?
The algorithm is called "\$k\$-means" (not "\$k\$-mean") so I would name the functions accordingly.
The
do_print
mechanism looks as if it was added to help with debugging and then you forgot to remove it. Surely no user of this code is going to be interested in this? It would be better to use Python's built-in debugging facilities so that you don't need to add this kind of debugging code.Similarly for the
"After %i generations stabilised"
message.The steps in the \$k\$-means algorithm are not normally known as generations so you might want to choose a clearer name, for example
steps
oriterations
.The iterator
range
is exclusive at the top end. Sorange(1, n_gen)
iterates from1
up tongen - 1
inclusive, which is probably not what you meant.If
n_gens
is less than 2,kmean
returnsNone
instead of a list of clusters. It would be better to raise an exception in this case, so that the caller doesn't have to handle the exceptional return value. Alternatively, you could ensure that at least one step is taken. (Which is what I've done in my revised code below.)Your chosen format for the return value of
kmean
(a list of clusters, each cluster being an array of datapoints) is likely to be inconvenient for further processing by NumPy. Generally in NumPy we want everything to be returned as uniform arrays. So if you do want to return the clusters (rather than the centroids), it would be better to return them in the form of an array giving the cluster number for each datapoint (that is, in the format of yourcluster_allocs
array).Your implementation omits the initialization step (that is, the selection of initial candidate values for the centroids). It would make sense to include at least one initialization method (for example, Forgy) so that the burden is not imposed on the caller.
The
kmean_step
operations are hard to follow because the variables are not clearly named (what isds
? what is thee
ine_dists
) and because you have expressed all your loops in the form of list comprehensions. When a comprehension becomes so long and unwieldy that you feel the need to use obscure variable names likeci
andcl
, it's probably time to rewrite it as a loop.The function
kmean_step
is not likely to be useful to the caller, so there's no good reason to put this part of the code in its own function. The component of the algorithm that's likely to be generally useful is the computation of the centroids of the clusters.Pairwise distances can be computed using
scipy.spatial.distance.cdist
.You don't need to take square roots: squared distances are sufficient to determine the closest centroid to each data point. So if you are using
scipy.spatial.distance.cdist
then you'd use thesqeuclidean
metric.
2. Revised code
Note the docstrings, the example code (which you can run as test cases using the doctest
module), and the simple and straightforward implementation. (It is a shame that there has to be a loop in cluster_centroids
, but I don't know how to avoid it.)
import numpy as np
def cluster_centroids(data, clusters, k=None):
"""Return centroids of clusters in data.
data is an array of observations with shape (A, B, ...).
clusters is an array of integers of shape (A,) giving the index
(from 0 to k-1) of the cluster to which each observation belongs.
The clusters must all be non-empty.
k is the number of clusters. If omitted, it is deduced from the
values in the clusters array.
The result is an array of shape (k, B, ...) containing the
centroid of each cluster.
>>> data = np.array([[12, 10, 87],
... [ 2, 12, 33],
... [68, 31, 32],
... [88, 13, 66],
... [79, 40, 89],
... [ 1, 77, 12]])
>>> cluster_centroids(data, np.array([1, 1, 2, 2, 0, 1]))
array([[ 79., 40., 89.],
[ 5., 33., 44.],
[ 78., 22., 49.]])
"""
if k is None:
k = np.max(clusters) + 1
result = np.empty(shape=(k,) + data.shape[1:])
for i in range(k):
np.mean(data[clusters == i], axis=0, out=result[i])
return result
import scipy.spatial
def kmeans(data, k=None, centroids=None, steps=20):
"""Divide the observations in data into clusters using the k-means
algorithm, and return an array of integers assigning each data
point to one of the clusters.
centroids, if supplied, must be an array giving the initial
position of the centroids of each cluster.
If centroids is omitted, the number k gives the number of clusters
and the initial positions of the centroids are selected randomly
from the data.
The k-means algorithm adjusts the centroids iteratively for the
given number of steps, or until no further progress can be made.
>>> data = np.array([[12, 10, 87],
... [ 2, 12, 33],
... [68, 31, 32],
... [88, 13, 66],
... [79, 40, 89],
... [ 1, 77, 12]])
>>> np.random.seed(73)
>>> kmeans(data, k=3)
array([1, 1, 2, 2, 0, 1])
"""
if centroids is not None and k is not None:
assert(k == len(centroids))
elif centroids is not None:
k = len(centroids)
elif k is not None:
# Forgy initialization method: choose k data points randomly.
centroids = data[np.random.choice(np.arange(len(data)), k, False)]
else:
raise RuntimeError("Need a value for k or centroids.")
for _ in range(max(steps, 1)):
# Squared distances between each point and each centroid.
sqdists = scipy.spatial.distance.cdist(centroids, data, 'sqeuclidean')
# Index of the closest centroid to each data point.
clusters = np.argmin(sqdists, axis=0)
new_centroids = cluster_centroids(data, clusters, k)
if np.array_equal(new_centroids, centroids):
break
centroids = new_centroids
return clusters
Explore related questions
See similar questions with these tags.
scipy.cluster.vq.kmeans
? \$\endgroup\$