2
\$\begingroup\$

I'm new to python and pandas.

I would like to use pandas groupby() to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.

Given input data with two groups, two variables X and Y:

n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
 ,"x" : np.hstack((
 np.random.normal(10, 1.0, size=n)
 ,np.random.normal(100, 1.0, size=n)
 ))
 ,"y" : np.hstack((
 np.random.normal(20, 1.0, size=n)
 ,np.random.normal(200, 1.0, size=n)
 )) 
 })

To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:

def get_outlier(x,tukeymultiplier=2):
 Q1=x.quantile(.25)
 Q3=x.quantile(.75)
 IQR=Q3-Q1
 lowerlimit = Q1 - tukeymultiplier*IQR
 upperlimit = Q3 + tukeymultiplier*IQR
 return (x<lowerlimit) | (x>upperlimit)

And then use groupby() and call the function via transform, e.g.:

g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.x.transform(get_outlier)
df['y_outlierflag']=g.y.transform(get_outlier)
df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]

I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?

Is this approach reasonably pythonic / in line with best practices? I would like to stick with pandas. I realize there are SQL approaches etc.

Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
asked Jun 9, 2023 at 16:21
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Consider adding a thousands-separator to 10_000.

When generating random sample data for this kind of application, always set a constant seed.

There's a lot of formatting here that's non-standard; run a PEP8 linter and/or use PyCharm. The only thing I'll call out specifically is leading commas like

,"y" : np.hstack((

Aside from being non-PEP8-compliant, it's just not legible.

Your use of groupby didn't actually work for me (different version of Pandas?), but your [['x', 'y']] just doesn't make sense since you re-index for those column names again later. You can just delete that index operation.

Don't transform on a custom function if you can avoid it; it breaks vectorisation. Instead, transform on only quantile which is built into Pandas and should run more quickly; this also means you don't have to write your own inner function.

Don't mutate a function argument in-place if you can avoid it.

Don't operate on x and y specifically. For this application, just broadcast to all non-key columns.

Suggested

import numpy as np
import pandas as pd
from numpy.random import default_rng
def flag_outliers(df: pd.DataFrame, tukey_multiplier: float = 2) -> pd.DataFrame:
 data_cols = df.columns[df.columns != 'key']
 groups = df.groupby('key')
 q1 = groups.transform('quantile', .25)
 q3 = groups.transform('quantile', .75)
 iqr = q3 - q1
 lower_limit = q1 - tukey_multiplier*iqr
 upper_limit = q3 + tukey_multiplier*iqr
 is_outlier = (df[data_cols] < lower_limit) | (df[data_cols] > upper_limit)
 is_outlier.columns = data_cols + '_outlier_flag'
 return is_outlier
def test() -> None:
 n = 10_000
 rand = default_rng(seed=0)
 df = pd.DataFrame({
 'key': ['a']*n + ['b']*n,
 'x': np.hstack((
 rand.normal(10, 1.0, size=n),
 rand.normal(100, 1.0, size=n),
 )),
 'y': np.hstack((
 rand.normal(20, 1.0, size=n),
 rand.normal(200, 1.0, size=n),
 )),
 })
 outliers = flag_outliers(df)
 # optionally pd.concat() outliers to df
 print(outliers)
if __name__ == '__main__':
 test()
answered Jun 9, 2023 at 23:32
\$\endgroup\$
3
  • \$\begingroup\$ Thanks, there is a lot there for me to learn. I'm a fan of the leading comma from SQL. But helpful to see that the extra trailing commas don't cause problems. And your indenting is better. Can you explain where I mutated a function argument in-place? I used [['x','y']] on my groupby because the data frame has some non-numeric columns. Maybe in your approach, I should use data_cols = df.drop(columns='key').select_dtypes(include=np.number).columns ? \$\endgroup\$ Commented Jun 10, 2023 at 18:28
  • \$\begingroup\$ where I mutated a function argument in-place - df['x_outlierflag']= (assuming that's in a function, which it should be). Or interpreted differently, you didn't mutate an input, but you also didn't capture enough of your code in a function. \$\endgroup\$ Commented Jun 10, 2023 at 22:14
  • \$\begingroup\$ I used [['x','y']] on my groupby because the data frame has some non-numeric columns - fine; but you should do that before your group operation, i.e. before passing the frame to the function. Preferably that instead of select_dtypes. \$\endgroup\$ Commented Jun 10, 2023 at 22:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.