Here is my personal implementation of the clustering k-means algorithm.
from scipy.spatial import distance
import numpy as np
import random
# (x,y) coordinates of a point
X = 0
Y = 1
def get_first(k, points):
return points[0:k]
def cost(cetroids, clusters):
cost = 0
for i in range(len(centroids)):
centroid = centroids[i]
cluster = clusters[i]
for point in cluster:
cost += (distance.euclidean(centroid, point))**2
return cost
def compute_centroids(clusters):
centroids = []
for cluster in clusters:
centroids.append(np.mean(cluster, axis=0))
return centroids
def kmeans(k, centroids, points, method, iter):
clusters = [[] for i in range(k)]
for i in range(len(points)):
point = points[i]
belongs_to_cluster = closest_centroid(point, centroids)
clusters[belongs_to_cluster].append(point)
new_centroids = compute_centroids(clusters)
if not equals(centroids, new_centroids):
print "Iteration " + str(iter) + ". Cost [k=" + str(k) + ", " + method + "] = " + str(cost(new_centroids, clusters))
clusters = kmeans(k, new_centroids, points, method, iter+1)
return clusters
def closest_centroid(point, centroids):
min_distance = float('inf')
belongs_to_cluster = None
for j in range(len(centroids)):
centroid = centroids[j]
dist = distance.euclidean(point, centroid)
if dist < min_distance:
min_distance = dist
belongs_to_cluster = j
return belongs_to_cluster
def contains(point1, points):
for point2 in points:
if point1[X] == point2[X] and point1[Y] == point2[Y]:
return True
return False
def equals(points1, points2):
if len(points1) != len(points2):
return False
for i in range (len(points1)):
point1 = points1[i]
point2 = points2[i]
if point1[X] != point2[X] or point1[Y] != point2[Y]:
return False
return True
if __name__ == "__main__":
data = [[-19.0748, -8.536 ],
[ 22.0108, -10.9737 ],
[ 12.6597, 19.2601 ],
[ 11.26884087, 19.90132146 ],
[ 15.44640731, 21.13121676 ],
[-20.03865146, -8.820872829],
[-19.65417726, -8.211477352],
[-15.97295894, -9.648002534],
[-18.74359696, -5.383551586],
[-19.453215, -8.146120006],
[-16.43074088, -7.524968005],
[-19.75512437, -8.533215751],
[-19.56237082, -8.798668569],
[-19.47135573, -8.057217004],
[-18.60946986, -4.475888949],
[-21.59368337, -10.38712463],
[-15.39158057, -3.8336522 ],
[-40.0, 40.0 ]]
k = 4
# k-means picking the first k points as centroids
centroids = get_first(k, data)
clusters = kmeans(k, centroids, data, "first", 1)
I understood and followed the theory of the algorithm, but as you can see, when running the code, the cost on each iteration of the algorithm increases. I am new to python and I think the problem relies in some misunderstanding from my side regarding list manipulation. Any ideas?
-
2\$\begingroup\$ Welcome to Code Review! I hope you get some good answers. \$\endgroup\$Phrancis– Phrancis2016年05月13日 20:38:58 +00:00Commented May 13, 2016 at 20:38
-
\$\begingroup\$ Also, I am open to suggestions regarding the quality of my code. \$\endgroup\$Daniyal Shahrokhian– Daniyal Shahrokhian2016年05月13日 21:47:32 +00:00Commented May 13, 2016 at 21:47
1 Answer 1
N.b. I'm pretty sure you can find optimised replacements of many of these functions in either the NumPy or SciPy library.
Edit: I don't immediately see a reason for a slowdown except the addition of new clusters. A bigger sample set that shows the behaviour would be nice, together with some timings and probably running a profiler on it.
In general you should likely use a NumPy array or matrix instead of
lists of lists as they are optimised for storing numeric data(!). That
would likely also eliminate the need for some of these functions or
considerably reduce their length. Also store all data in the same
containers - that way you don't have to create new functions like
equals
and contains
yourself to compare between different
representations.
- The
X
/Y
definitions at the start are more of a WTF for me. I'd get rid of that by writing code that's either independent on the number of dimensions, or just use0
/1
- IMO it's not magic enough to warrant separate constants. get_first
isn't neccessary - the pattern is obvious enough to not require encapsulation in a function. The zero is also optional.- Typo in
cost
signature, should becentroids
. - The pattern
for i in range(len(...)):
appears a couple of times and should be replaced with proper iteration over some helper generator. E.g. incost
the parallel iteration over bothcentroids
andclusters
should be done byzip
(itertools.izip
in Python 2.7 if used in the same way). It can also be simplified by using another SciPy function,cdist
. - If you still need the index, use
enumerate
. - You can also often get away with using the squared euclidean if the exact value isn't relevant, just the relation between different distances.
- The pattern to initialise a list of empty lists should probably use
_
instead ofi
to make it obvious that the variable serves to further purpose. print
should be used as a function to make it more uniform. Also, use one of the various formatting options (%
orformat
) to get rid of the additionalstr
calls.compute_centroids
can be simplified with a list comprehension.
Looks like this now:
from scipy.spatial import distance
import numpy as np
import random
from itertools import izip
def cost(centroids, clusters):
return sum(distance.cdist([centroid], cluster, 'sqeuclidean').sum()
for centroid, cluster in izip(centroids, clusters))
def compute_centroids(clusters):
return [np.mean(cluster, axis=0) for cluster in clusters]
def kmeans(k, centroids, points, method):
clusters = [[] for _ in range(k)]
for point in points:
clusters[closest_centroid(point, centroids)].append(point)
new_centroids = compute_centroids(clusters)
if not equals(centroids, new_centroids):
print("cost [k={}, {}] = {}".format(k, method, cost(new_centroids, clusters)))
clusters = kmeans(k, new_centroids, points, method)
return clusters
def closest_centroid(point, centroids):
min_distance = float('inf')
belongs_to_cluster = None
for j, centroid in enumerate(centroids):
dist = distance.sqeuclidean(point, centroid)
if dist < min_distance:
min_distance = dist
belongs_to_cluster = j
return belongs_to_cluster
def contains(point1, points):
for point2 in points:
if point1[0] == point2[0] and point1[1] == point2[1]:
# if all(x == y for x, y in izip(points1, points2)):
return True
return False
def equals(points1, points2):
if len(points1) != len(points2):
return False
for point1, point2 in izip(points1, points2):
if point1[0] != point2[0] or point1[1] != point2[1]:
# if any(x != y for x, y in izip(points1, points2)):
return False
return True
if __name__ == "__main__":
data = [[-19.0748, -8.536 ],
[ 22.0108, -10.9737 ],
[ 12.6597, 19.2601 ],
[ 11.26884087, 19.90132146 ],
[ 15.44640731, 21.13121676 ],
[-20.03865146, -8.820872829],
[-19.65417726, -8.211477352],
[-15.97295894, -9.648002534],
[-18.74359696, -5.383551586],
[-19.453215, -8.146120006],
[-16.43074088, -7.524968005],
[-19.75512437, -8.533215751],
[-19.56237082, -8.798668569],
[-19.47135573, -8.057217004],
[-18.60946986, -4.475888949],
[-21.59368337, -10.38712463],
[-15.39158057, -3.8336522 ],
[-40.0, 40.0 ]]
k = 4
# k-means picking the first k points as centroids
centroids = data[:k]
clusters = kmeans(k, centroids, data, "first")
-
1\$\begingroup\$ This is... just prodigious. Sharp, neat and complete answer. I will revise the whole thing tomorrow morning. Thanks a lot @ferada. \$\endgroup\$Daniyal Shahrokhian– Daniyal Shahrokhian2016年05月14日 03:21:34 +00:00Commented May 14, 2016 at 3:21
Explore related questions
See similar questions with these tags.