K-Means clustering in Python2

Question 1

I've implemented the K-Means clustering algorithm in Python2, and I wanted to know what remarks you guys could make regarding my code. I've included a small test set with 2D-vectors and 2 classes, but it works with higher dimensions and more classes. Here, it should sort all the elements starting with the same letters in the same classes (except ak, with is quite in between).

What I'm interested in:

Did I do something unpythonic?
Are there unclear or unnecessary parts?
What changes could drastically improve the code's performance?

from __future__ import print_function
from operator import add, sub
from _collections import defaultdict
import random as rd
import math
class Element:
 def __init__(self, idf, data, cls='ukn'):
 self.idf = idf
 self.data = data
 self.cls = cls
 def __str__(self):
 return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))
 def eucl_dist_to(self, other):
 diff = map(sub, self.data, other.data)
 norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))
 return norm2
def main():
 test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),
 ('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),
 ('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),
 ('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]
 test_data_elt = list()
 for t in test_data_tuple:
 test_data_elt.append(Element(t[0], tuple(t[1:])))
 # test_data_elt.sort(key=lambda e: e.idf)
 results = k_means(test_data_elt, 2, 5)
def k_means(elts, nb_classes, nb_steps):
 k = nb_classes
 #init
 centroids = list()
 cls_content = defaultdict(list)
 for cls_nb in range(k):
 cls_name = 'class_'+str(cls_nb)
 centroids.append(Element(cls_name, rd.choice(elts).data, cls='centroid'))
 cls_content[cls_name] = list()
 for itr in range(nb_steps):
 # assign
 cls_content.clear()
 for elt in elts:
 min_dist = float('inf')
 for c in centroids:
 dist = c.eucl_dist_to(elt)
 if dist < min_dist:
 min_dist = dist
 best_class = c.idf
 cls_content[best_class].append(elt)
 elt.cls = best_class
 # adjust
 for c in centroids:
 elts_in = list(cls_content[c.idf])
 if len(elts_in) == 0:
 sum_elts_data = 0
 else:
 sum_elts_data = list(elts_in[0].data)
 for elt in elts_in[1:]:
 sum_elts_data = map(add, sum_elts_data, elt.data)
 sum_elts_data[:] = [e / float(len(elts_in)) for e in sum_elts_data]
 c.data = tuple(sum_elts_data)
 print('\nIteration %d' % itr)
 for c in centroids:
 cls_name = c.idf
 print('\n'+str(c))
 for e in cls_content[cls_name]:
 print(e)
 return cls_content
if __name__ == '__main__':
 main()

Question 2

Did you consider using scipy.cluster.vq.kmeans?

Question 3

Well, the point was to code the thing myself, I sure assumed that really good implementations already existed, I just wanted to have a go on my own :) Although I didn't know this one, thanks for having linked it!

Question 4

That's fine, I just wanted to check. (Take a look at this answer for some implementation ideas.)

Question 5

something unpythonic?

Consider using plural identifiers: test_data_tuples, test_data_elts. Later, for example, you use a perfectly lovely for elt in elts: idiom.

In Element __init__, I wouldn't mind seeing cls='unknown' spelled out, but I am really missing a docstring that explains what the heck idf stands for. Also, as near as I can tell, data really is a coord, e.g. (x, y).

Kudos on k = nb_classes. I feel you did a good job of making the public API very clear, and then within the body k is conventional.

I can't say I'm a fan of import random as rd. It needlessly obscured the fairly obvious meaning of rd.choice(elts).data.

The for itr ... loop exhibits admirable clarity.

Testing if len(elts_in) == 0:, plus the map, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could use sum and lambda, or a helper function? It seems like we could always find something sensible to assign to c.data, even in the empty case. Or maybe what I'm really looking for is an if that rapidly moves on to a continue.

The print('\n'+str(c)) is not very pythonic. I'd prefer print('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

J_H J_HJ_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2017-08-26 00:48:50Z

something unpythonic?

Consider using plural identifiers: test_data_tuples, test_data_elts. Later, for example, you use a perfectly lovely for elt in elts: idiom.

In Element __init__, I wouldn't mind seeing cls='unknown' spelled out, but I am really missing a docstring that explains what the heck idf stands for. Also, as near as I can tell, data really is a coord, e.g. (x, y).

Kudos on k = nb_classes. I feel you did a good job of making the public API very clear, and then within the body k is conventional.

I can't say I'm a fan of import random as rd. It needlessly obscured the fairly obvious meaning of rd.choice(elts).data.

The for itr ... loop exhibits admirable clarity.

Testing if len(elts_in) == 0:, plus the map, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could use sum and lambda, or a helper function? It seems like we could always find something sensible to assign to c.data, even in the empty case. Or maybe what I'm really looking for is an if that rapidly moves on to a continue.

The print('\n'+str(c)) is not very pythonic. I'd prefer print('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

Stack Exchange Network

K-Means clustering in Python2

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

K-Means clustering in Python2

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions