Skip to main content
Code Review

Return to Question

Commonmark migration
Source Link

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.
.
Source Link

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • 'kmean_step'kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • 'kmean_step' seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np
def kmean_step(centroids, datapoints):
 ds = centroids[:,np.newaxis]-datapoints
 e_dists = np.sqrt(np.sum(np.square(ds),axis=-1))
 cluster_allocs = np.argmin(e_dists, axis=0)
 clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
 new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
 return new_centroids, clusters
def kmean(centroids, datapoints, n_gen, do_print=True):
 clusters=None
 for ii in range(1,n_gen):
 new_centroids, clusters = kmean_step(centroids, datapoints)
 if np.array_equal(centroids,new_centroids):
 print "After %i generations stabalised" % ii
 break
 else:
 centroids = new_centroids
 if do_print:
 print_clusters(clusters,centroids)
 return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.
edited tags
Link
200_success
  • 145.5k
  • 22
  • 190
  • 478
Fixed mistake in finding number of generations
Source Link
Loading
.
Source Link
Loading
edited tags
Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238
Loading
fix grammatial errors
Source Link
Phrancis
  • 20.5k
  • 6
  • 69
  • 155
Loading
Source Link
Loading
lang-py

AltStyle によって変換されたページ (->オリジナル) /