Return to Question

Commonmark migration

edited Jun 10, 2020 at 13:24

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

Source Link

edited Aug 31, 2014 at 7:07

Frames Catherine White

edited Aug 31, 2014 at 7:07

Frames Catherine White

1.3k
3
15
27

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

'kmean_step'kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

'kmean_step' seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.