Simple k-means implemention using Python3 and Pandas

Question 1

Is there anything I can improve? The distance function is Pearson correlation.

import os
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
def corrpairs(df1, df2):
 """
 Pairwise correlation for columns of two data frames
 :param df1:
 :type df1:
 :param df2:
 :type df2:
 :return:
 :rtype: pandas.core.frame.DataFrame
 """
 return df1.apply(lambda x: df2.corrwith(x))
import pdb
def kcluster(cols, k=4):
 """
 K Means clustering algorithm, applied to columns of a data frame.
 Using Pearson correlation as the distance function.
 :param rows:
 :type rows: pandas.core.frame.DataFrame
 :param k:
 :type k: int
 :return:
 :rtype: list[int]
 """
 cols = cols.astype(float)
 nrow, ncol = cols.shape
 nuclear0 = cols.iloc[:, :k]
 nuclear0.columns = range(k)
 nuclear0 += np.random.randn(np.prod(nuclear0.shape)).reshape(nuclear0.shape)
 correlations = corrpairs(cols, nuclear0)
 groups = correlations.idxmax(axis=0)
 nuclear1 = []
 for i in range(k):
 sub_cols = cols.loc[:, groups == i]
 sub_mean = sub_cols.mean(axis=1)
 nuclear1.append(sub_mean)
 nuclear1 = pd.concat(nuclear1, axis=1)
 while ((nuclear0 - nuclear1).abs() > 0.00001).any().any():
 print(nuclear0)
 print(nuclear1)
 print((nuclear0 - nuclear1).abs())
 nuclear0 = nuclear1
 correlations = corrpairs(cols, nuclear0)
 groups = correlations.idxmax(axis=0)
 nuclear1 = []
 for i in range(k):
 sub_cols = cols.loc[:, groups == i]
 sub_mean = sub_cols.mean(axis=1)
 nuclear1.append(sub_mean)
 nuclear1 = pd.concat(nuclear1, axis=1)
 return groups

Question 2

Are you sure that using Pearson correlation with K-means is a good idea? See here

Question 3

You have a few problems with your documentation.

First off, your documentation is incomplete in a few places.

For example, in your corrpairs function, you didn't fill in any of your documentation, except for the rtype part.

And, in your kcluster function, you only filled in type rows, type k, and rtype.

Finally, also in kcluster, you called the parameter "rows" in the documentation and called it "cols" in the function signature. Choose one and stick with it.

Documentation is a very important part of every function.

import pdb
def kcluster(cols, k=4):

You should not have an import in the middle of your code; all the importing should be done at the very top of your code like you were doing before.

This review was just mean to point out practices. I had trouble understanding the content of the code (a better documentation probably would've helped).

SirPython SirPython 13.4k3 gold badges38 silver badges93 bronze badges · Answer 1 · 2015-07-01 23:49:40Z

You have a few problems with your documentation.

First off, your documentation is incomplete in a few places.

For example, in your corrpairs function, you didn't fill in any of your documentation, except for the rtype part.

And, in your kcluster function, you only filled in type rows, type k, and rtype.

Finally, also in kcluster, you called the parameter "rows" in the documentation and called it "cols" in the function signature. Choose one and stick with it.

Documentation is a very important part of every function.

import pdb
def kcluster(cols, k=4):

You should not have an import in the middle of your code; all the importing should be done at the very top of your code like you were doing before.

This review was just mean to point out practices. I had trouble understanding the content of the code (a better documentation probably would've helped).

Stack Exchange Network

Simple k-means implemention using Python3 and Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple k-means implemention using Python3 and Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions