Compute conditional median of PANDAS dataframe

Question 1

I am new to Python/Pandas. Consider the following code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19]})
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:

df['feature_2'] = np.NaN
for i in range(len(df)):
 current_Id = df.Id[i]
 current_time = df.Time[i]
 idx = (df.Time == current_time) & (df.Id != current_Id)
 if idx.any():
 df['feature_2'][i] = df.A[idx].median()
print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

Question 2

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values
 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0

An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series of length the length of the group, which is supported by SeriesGroupBy.transform.

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix with NaNs;
ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd
def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
 df = pd.DataFrame({
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 })
 compute_medians(df)
 print(df)

Question 3

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?

Question 4

@Romain Absolutely, fixed it.

Question 5

cool solution. nit: the method compute_medians should have a return statement, e.g. return None or return dataframe.

Question 6

@Quetzalcoatl why add return None? It's just noise as it's the default value produced when there is no return statement. Why return dataframe? Since we are modifying the provided value in-place this could lure the caller into thinking this is a modified copy.

score 4 · Accepted Answer · 2018-05-14 14:53:27Z

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values
 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0

An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series of length the length of the group, which is supported by SeriesGroupBy.transform.

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix with NaNs;
ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd
def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
 df = pd.DataFrame({
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 })
 compute_medians(df)
 print(df)

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
cool solution. nit: the method compute_medians should have a return statement, e.g. return None or return dataframe.
@Quetzalcoatl why add return None? It's just noise as it's the default value produced when there is no return statement. Why return dataframe? Since we are modifying the provided value in-place this could lure the caller into thinking this is a modified copy.

Stack Exchange Network

Compute conditional median of PANDAS dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compute conditional median of PANDAS dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions