Flag tukey outliers using python pandas groupby

Question 1

I'm new to python and pandas.

I would like to use pandas groupby() to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.

Given input data with two groups, two variables X and Y:

n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
 ,"x" : np.hstack((
 np.random.normal(10, 1.0, size=n)
 ,np.random.normal(100, 1.0, size=n)
 ))
 ,"y" : np.hstack((
 np.random.normal(20, 1.0, size=n)
 ,np.random.normal(200, 1.0, size=n)
 )) 
 })

To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:

def get_outlier(x,tukeymultiplier=2):
 Q1=x.quantile(.25)
 Q3=x.quantile(.75)
 IQR=Q3-Q1
 lowerlimit = Q1 - tukeymultiplier*IQR
 upperlimit = Q3 + tukeymultiplier*IQR
 return (x<lowerlimit) | (x>upperlimit)

And then use groupby() and call the function via transform, e.g.:

g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.x.transform(get_outlier)
df['y_outlierflag']=g.y.transform(get_outlier)
df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]

I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?

Is this approach reasonably pythonic / in line with best practices? I would like to stick with pandas. I realize there are SQL approaches etc.

Question 2

Consider adding a thousands-separator to 10_000.

When generating random sample data for this kind of application, always set a constant seed.

There's a lot of formatting here that's non-standard; run a PEP8 linter and/or use PyCharm. The only thing I'll call out specifically is leading commas like

,"y" : np.hstack((

Aside from being non-PEP8-compliant, it's just not legible.

Your use of groupby didn't actually work for me (different version of Pandas?), but your [['x', 'y']] just doesn't make sense since you re-index for those column names again later. You can just delete that index operation.

Don't transform on a custom function if you can avoid it; it breaks vectorisation. Instead, transform on only quantile which is built into Pandas and should run more quickly; this also means you don't have to write your own inner function.

Don't mutate a function argument in-place if you can avoid it.

Don't operate on x and y specifically. For this application, just broadcast to all non-key columns.

Suggested

import numpy as np
import pandas as pd
from numpy.random import default_rng
def flag_outliers(df: pd.DataFrame, tukey_multiplier: float = 2) -> pd.DataFrame:
 data_cols = df.columns[df.columns != 'key']
 groups = df.groupby('key')
 q1 = groups.transform('quantile', .25)
 q3 = groups.transform('quantile', .75)
 iqr = q3 - q1
 lower_limit = q1 - tukey_multiplier*iqr
 upper_limit = q3 + tukey_multiplier*iqr
 is_outlier = (df[data_cols] < lower_limit) | (df[data_cols] > upper_limit)
 is_outlier.columns = data_cols + '_outlier_flag'
 return is_outlier
def test() -> None:
 n = 10_000
 rand = default_rng(seed=0)
 df = pd.DataFrame({
 'key': ['a']*n + ['b']*n,
 'x': np.hstack((
 rand.normal(10, 1.0, size=n),
 rand.normal(100, 1.0, size=n),
 )),
 'y': np.hstack((
 rand.normal(20, 1.0, size=n),
 rand.normal(200, 1.0, size=n),
 )),
 })
 outliers = flag_outliers(df)
 # optionally pd.concat() outliers to df
 print(outliers)
if __name__ == '__main__':
 test()

Question 3

Thanks, there is a lot there for me to learn. I'm a fan of the leading comma from SQL. But helpful to see that the extra trailing commas don't cause problems. And your indenting is better. Can you explain where I mutated a function argument in-place? I used [['x','y']] on my groupby because the data frame has some non-numeric columns. Maybe in your approach, I should use data_cols = df.drop(columns='key').select_dtypes(include=np.number).columns ?

Question 4

where I mutated a function argument in-place - df['x_outlierflag']= (assuming that's in a function, which it should be). Or interpreted differently, you didn't mutate an input, but you also didn't capture enough of your code in a function.

Question 5

I used [['x','y']] on my groupby because the data frame has some non-numeric columns - fine; but you should do that before your group operation, i.e. before passing the frame to the function. Preferably that instead of select_dtypes.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2023-06-09 23:32:06Z

Consider adding a thousands-separator to 10_000.

When generating random sample data for this kind of application, always set a constant seed.

There's a lot of formatting here that's non-standard; run a PEP8 linter and/or use PyCharm. The only thing I'll call out specifically is leading commas like

,"y" : np.hstack((

Aside from being non-PEP8-compliant, it's just not legible.

Your use of groupby didn't actually work for me (different version of Pandas?), but your [['x', 'y']] just doesn't make sense since you re-index for those column names again later. You can just delete that index operation.

Don't transform on a custom function if you can avoid it; it breaks vectorisation. Instead, transform on only quantile which is built into Pandas and should run more quickly; this also means you don't have to write your own inner function.

Don't mutate a function argument in-place if you can avoid it.

Don't operate on x and y specifically. For this application, just broadcast to all non-key columns.

Suggested

import numpy as np
import pandas as pd
from numpy.random import default_rng
def flag_outliers(df: pd.DataFrame, tukey_multiplier: float = 2) -> pd.DataFrame:
 data_cols = df.columns[df.columns != 'key']
 groups = df.groupby('key')
 q1 = groups.transform('quantile', .25)
 q3 = groups.transform('quantile', .75)
 iqr = q3 - q1
 lower_limit = q1 - tukey_multiplier*iqr
 upper_limit = q3 + tukey_multiplier*iqr
 is_outlier = (df[data_cols] < lower_limit) | (df[data_cols] > upper_limit)
 is_outlier.columns = data_cols + '_outlier_flag'
 return is_outlier
def test() -> None:
 n = 10_000
 rand = default_rng(seed=0)
 df = pd.DataFrame({
 'key': ['a']*n + ['b']*n,
 'x': np.hstack((
 rand.normal(10, 1.0, size=n),
 rand.normal(100, 1.0, size=n),
 )),
 'y': np.hstack((
 rand.normal(20, 1.0, size=n),
 rand.normal(200, 1.0, size=n),
 )),
 })
 outliers = flag_outliers(df)
 # optionally pd.concat() outliers to df
 print(outliers)
if __name__ == '__main__':
 test()

Thanks, there is a lot there for me to learn. I'm a fan of the leading comma from SQL. But helpful to see that the extra trailing commas don't cause problems. And your indenting is better. Can you explain where I mutated a function argument in-place? I used [['x','y']] on my groupby because the data frame has some non-numeric columns. Maybe in your approach, I should use data_cols = df.drop(columns='key').select_dtypes(include=np.number).columns ?
where I mutated a function argument in-place - df['x_outlierflag']= (assuming that's in a function, which it should be). Or interpreted differently, you didn't mutate an input, but you also didn't capture enough of your code in a function.
I used [['x','y']] on my groupby because the data frame has some non-numeric columns - fine; but you should do that before your group operation, i.e. before passing the frame to the function. Preferably that instead of select_dtypes.

Stack Exchange Network

Flag tukey outliers using python pandas groupby

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Flag tukey outliers using python pandas groupby

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions