3
\$\begingroup\$

I am new to Python/Pandas. Consider the following code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19]})
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:

df['feature_2'] = np.NaN
for i in range(len(df)):
 current_Id = df.Id[i]
 current_time = df.Time[i]
 idx = (df.Time == current_time) & (df.Id != current_Id)
 if idx.any():
 df['feature_2'][i] = df.A[idx].median()
print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked May 14, 2018 at 13:06
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values
 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0

An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series of length the length of the group, which is supported by SeriesGroupBy.transform.

The steps to get the desired result are:

  • build the matrix by repeating the input group as many time as its length;
  • fill the diagonal of the matrix with NaNs;
  • ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd
def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
 df = pd.DataFrame({
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 })
 compute_medians(df)
 print(df)
answered May 14, 2018 at 14:53
\$\endgroup\$
4
  • \$\begingroup\$ Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ? \$\endgroup\$ Commented May 14, 2018 at 16:44
  • \$\begingroup\$ @Romain Absolutely, fixed it. \$\endgroup\$ Commented May 14, 2018 at 17:09
  • \$\begingroup\$ cool solution. nit: the method compute_medians should have a return statement, e.g. return None or return dataframe. \$\endgroup\$ Commented Nov 8, 2019 at 20:02
  • \$\begingroup\$ @Quetzalcoatl why add return None? It's just noise as it's the default value produced when there is no return statement. Why return dataframe? Since we are modifying the provided value in-place this could lure the caller into thinking this is a modified copy. \$\endgroup\$ Commented Nov 8, 2019 at 20:13

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.