This is the first mini-project that I'm working on python, where I implement k-means. I'm planning to parallelize it as soon as I've written a good serial version.
Code description:
Below you will find the working code. Following, I describe it, you can skip if you want:
generate_clusters
: generatesk
random clusters withn
points ind
dimensions. Each point is assigned to one cluster with at distance at mostdeviation
CLUMPY.__init__
: the user has to specify the number of clusters (i.e.,k
) and optionally give afile
of the dataset to clusterize (elements are divided bydelimiter
). Otherwise, it must specify the number of pointsn
and their dimensionsk
: in this case,generate_clusters
is called.iterations
is the limit for the number of iterations in the algorithm. The constructor initializes the most important arrays, i.e.points
,centroids
,class
,clusters_size
andclusters_sum
(read the comments to know what they mean).
Design choices:
I could have used other data structures, but I decided to use numpy.arrays
only to obtain the best efficiency.
Seeds are generated randomly among points
. I'm planning to implement the k-means++ initialization which has been shown to give better results.
Stopping conditions are:
- A number of iterations equal to
iterations
are performed. - The centroids do not change between 2 iterations.
energy
goes below a given threshold (todo).
Code:
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.decomposition import PCA as sklearnPCA
def generate_clusters (k, n, d, max_value=30000, deviation=10000):
# generate n random points in d dimensions with elements in [-deviation, deviation]
points = np.random.np.random.uniform(low=-deviation, high=deviation, size=(n, d))
# generate k points in d dimensions with elements in [0, max_value]
centers = np.random.random((k, d))*max_value
# generate clusters: for each point, randomly select a center and add to it
print(points, centers)
for i, point in enumerate(points):
points[i] += random.choice(centers)
return points
class CLUMPY:
def __init__(self, k, file=None, n=None, d=None, delimiter=None, iterations=100):
self.__file = file # input file
self.__k = k # number of clusters
self.__iterations = iterations # number of iterations
self.__colors = \
cm.rainbow(np.linspace(0, 1, self.__k)) # colors[i] = color of the i-th cluster
if file:
# if file is specified, read points from file
print("Reading {}...".format(file))
self.__points = np.loadtxt(file, delimiter=delimiter) # data points
else:
# otherwise generate n clusterized points in d dimensions
if not n or not d:
raise ValueError("missing n={} or d={}".format(n, d))
self.__n = n
self.__d = d
print("Generating {} random points in {} dimensions...".format(n, d))
self.__points = generate_clusters(k, n, d)
self.__d = self.__points.shape[1] # points dimensions
self.__n = self.__points.shape[0] # number of data points
self.__centroids = \
np.empty((k, self.__d)) # centroids[i] = i-th centroid vector
# class[i] = j : the i-th data point is assigned to the j-th cluster
self.__class = np.full(self.__n, -1, dtype=np.int16)
# energy[i] : energy of the i-th cluster
self.__energy = np.zeros(k)
self.__distances = np.zeros(self.__n)
self.__clusters_size = np.zeros(self.__k, dtype=np.int32) # number of points assigned to each cluster
self.__clusters_sum = np.zeros((k, self.__d)) # sum of all vectors assigned to each cluster
# sanity checks
if self.__n < k:
raise ValueError("Number of clusters k={} is smaller than number of data points n={}".format(k, self.__n))
if self.__d < 2:
raise ValueError("data points must have at least two dimensions")
print("{} points in {} dimensions.".format(self.__n, self.__d))
print("Generating seeds...")
# generate k random indexes
random_indexes = list(range(self.__n))
random.shuffle(random_indexes)
# we decide centroids by randomly picking up data points
for i in range(k):
self.__centroids[i] = self.__points[random_indexes[i]]
self.plot()
def assign_datapoints(self):
# for each datapoint
for i, point in enumerate(self.__points):
min_distance_index = float('nan')
min_distance = math.inf
# for each centroid
for j, centroid in enumerate(self.__centroids):
# compute the euclidean distance between the i-th point and the j-th centroid
d = np.linalg.norm(point - centroid)
if d < min_distance:
min_distance_index = j
min_distance = d
# update cluster assignment and distances
if not math.isnan(min_distance_index):
if self.__class[i] != -1:
self.__clusters_size[self.__class[i]] -= 1
self.__clusters_sum[self.__class[i]] -= point
self.__energy[self.__class[i]] -= self.__distances[i]
self.__class[i] = min_distance_index
self.__clusters_size[min_distance_index] += 1
self.__clusters_sum[min_distance_index] += point
self.__distances[i] = min_distance ** 2
self.__energy[min_distance_index] += self.__distances[i]
def plot(self):
print("Plotting...")
points = np.concatenate((self.__centroids, self.__points), axis=0)
if self.__d > 2:
point_norm = (points - points.min())/(points.max() - points.min())
pca = sklearnPCA(n_components=2) # 2-dimensional PCA
points = np.array(pca.fit_transform(point_norm))
for i, (X, Y) in enumerate(points):
if i<self.__k:
plt.scatter(X, Y, c=self.__colors[i], s=100, marker="^")
else:
plt.scatter(X, Y, c=self.__colors[self.__class[i-self.__k]])
plt.show()
def cluster(self):
for iteration in range(self.__iterations):
print("iteration", iteration)
# update each assignment: if no point changed its cluster, then we have reached the optimum
self.assign_datapoints()
# update centroids
centroids_unchanged = True
for i in range(self.__k):
new_centroid = self.__clusters_sum[i] / self.__clusters_size[i]
centroids_unchanged = centroids_unchanged and np.array_equal(new_centroid, self.__centroids[i])
self.__centroids[i] = new_centroid
if centroids_unchanged:
print("Centroids unchanged, terminating...")
break
#self.plot()
else:
print("All iterations are finished")
self.plot()
if __name__ == "__main__":
clumpy = CLUMPY(k=5, d=64, n=30000, delimiter=",")
clumpy.cluster()
Example result:
Used the main version.
1 Answer 1
First, while Clumpy
is a nice name, it doesn't really say what the class is doing. As a matter of fact, by looking at the code in the class, it tried to solve three different problems, that, in my opinion, shouldn't be grouped in a class. Creating points, plotting and clustering are three very different responsibilities, that's why you usually use three distinct libraries to do it (eg. matplotlib, numpy and sklearn). I think it should be three separate functions in your code, it'll permit much more flexibility.
In the generate_clusters
method, I think you could use a little more documentation. I think that the names deviation
and max_value
are confusing and should be renamed. Since you're creating clusters, I think deviation
should be maxDistance
or radius
or something like that. Regarding max_value
, it's not clear that it relates to the maximal distance of the centers, but I have a hard time finding another name right now. Otherwise, I like the creativity used to create the clusters, but I hope you've also tested your clustering code with points that aren't so easily clustered.
Using the __
prefix clutters your code a lot. I'd settle for one, personally, or none at all.
You use wayyyyy to many comments. A comment should be used when the code isn't clear enough by itself, and only if you didn't find a way (or didn't have time for business reasons) to make it clearer. All the comments by variable names should be deleted and.. I think, all the others too. Your code is pretty clear, the comments are only in the way.
You assign a lot of arrays before using them, that would probably cause problems for large n
values, which are pretty common in the field of data analysis. You should consider trying to reduce your memory footprint by using large arrays only when you need them instead of assigning them at first.
I've said it in the review, but the main thing that needs work is to separate the logic in functions. There're no good reasons to have the file reading, the plotting and the clustering in an object! I'm sure once you've done the work to split the code you'll find it much clearer.
-
\$\begingroup\$ I've seen at least one library give clustering, file reading, and plotting to one object. It does allow for some very quick-to-start "beginner friendly" prototyping (maybe that is the intended audience?). I would say the really big concern here is no distinction between plotting and data IMHO. \$\endgroup\$Dair– Dair2019年08月06日 03:47:42 +00:00Commented Aug 6, 2019 at 3:47