1
\$\begingroup\$

Is there anything I can improve? The distance function is Pearson correlation.

import os
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
def corrpairs(df1, df2):
 """
 Pairwise correlation for columns of two data frames
 :param df1:
 :type df1:
 :param df2:
 :type df2:
 :return:
 :rtype: pandas.core.frame.DataFrame
 """
 return df1.apply(lambda x: df2.corrwith(x))
import pdb
def kcluster(cols, k=4):
 """
 K Means clustering algorithm, applied to columns of a data frame.
 Using Pearson correlation as the distance function.
 :param rows:
 :type rows: pandas.core.frame.DataFrame
 :param k:
 :type k: int
 :return:
 :rtype: list[int]
 """
 cols = cols.astype(float)
 nrow, ncol = cols.shape
 nuclear0 = cols.iloc[:, :k]
 nuclear0.columns = range(k)
 nuclear0 += np.random.randn(np.prod(nuclear0.shape)).reshape(nuclear0.shape)
 correlations = corrpairs(cols, nuclear0)
 groups = correlations.idxmax(axis=0)
 nuclear1 = []
 for i in range(k):
 sub_cols = cols.loc[:, groups == i]
 sub_mean = sub_cols.mean(axis=1)
 nuclear1.append(sub_mean)
 nuclear1 = pd.concat(nuclear1, axis=1)
 while ((nuclear0 - nuclear1).abs() > 0.00001).any().any():
 print(nuclear0)
 print(nuclear1)
 print((nuclear0 - nuclear1).abs())
 nuclear0 = nuclear1
 correlations = corrpairs(cols, nuclear0)
 groups = correlations.idxmax(axis=0)
 nuclear1 = []
 for i in range(k):
 sub_cols = cols.loc[:, groups == i]
 sub_mean = sub_cols.mean(axis=1)
 nuclear1.append(sub_mean)
 nuclear1 = pd.concat(nuclear1, axis=1)
 return groups
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jan 6, 2015 at 23:54
\$\endgroup\$
1
  • \$\begingroup\$ Are you sure that using Pearson correlation with K-means is a good idea? See here \$\endgroup\$ Commented Jan 8, 2015 at 6:33

1 Answer 1

2
\$\begingroup\$

You have a few problems with your documentation.


First off, your documentation is incomplete in a few places.

For example, in your corrpairs function, you didn't fill in any of your documentation, except for the rtype part.

And, in your kcluster function, you only filled in type rows, type k, and rtype.

Finally, also in kcluster, you called the parameter "rows" in the documentation and called it "cols" in the function signature. Choose one and stick with it.

Documentation is a very important part of every function.


import pdb
def kcluster(cols, k=4):

You should not have an import in the middle of your code; all the importing should be done at the very top of your code like you were doing before.


This review was just mean to point out practices. I had trouble understanding the content of the code (a better documentation probably would've helped).

answered Jul 1, 2015 at 23:49
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.