I'm new to python and pandas.
I would like to use pandas groupby()
to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.
Given input data with two groups, two variables X and Y:
n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
,"x" : np.hstack((
np.random.normal(10, 1.0, size=n)
,np.random.normal(100, 1.0, size=n)
))
,"y" : np.hstack((
np.random.normal(20, 1.0, size=n)
,np.random.normal(200, 1.0, size=n)
))
})
To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:
def get_outlier(x,tukeymultiplier=2):
Q1=x.quantile(.25)
Q3=x.quantile(.75)
IQR=Q3-Q1
lowerlimit = Q1 - tukeymultiplier*IQR
upperlimit = Q3 + tukeymultiplier*IQR
return (x<lowerlimit) | (x>upperlimit)
And then use groupby()
and call the function via transform, e.g.:
g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.x.transform(get_outlier)
df['y_outlierflag']=g.y.transform(get_outlier)
df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]
I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply()
approach that would be better?
Is this approach reasonably pythonic / in line with best practices? I would like to stick with pandas. I realize there are SQL approaches etc.
1 Answer 1
Consider adding a thousands-separator to 10_000
.
When generating random sample data for this kind of application, always set a constant seed.
There's a lot of formatting here that's non-standard; run a PEP8 linter and/or use PyCharm. The only thing I'll call out specifically is leading commas like
,"y" : np.hstack((
Aside from being non-PEP8-compliant, it's just not legible.
Your use of groupby
didn't actually work for me (different version of Pandas?), but your [['x', 'y']]
just doesn't make sense since you re-index for those column names again later. You can just delete that index operation.
Don't transform
on a custom function if you can avoid it; it breaks vectorisation. Instead, transform
on only quantile
which is built into Pandas and should run more quickly; this also means you don't have to write your own inner function.
Don't mutate a function argument in-place if you can avoid it.
Don't operate on x
and y
specifically. For this application, just broadcast to all non-key columns.
Suggested
import numpy as np
import pandas as pd
from numpy.random import default_rng
def flag_outliers(df: pd.DataFrame, tukey_multiplier: float = 2) -> pd.DataFrame:
data_cols = df.columns[df.columns != 'key']
groups = df.groupby('key')
q1 = groups.transform('quantile', .25)
q3 = groups.transform('quantile', .75)
iqr = q3 - q1
lower_limit = q1 - tukey_multiplier*iqr
upper_limit = q3 + tukey_multiplier*iqr
is_outlier = (df[data_cols] < lower_limit) | (df[data_cols] > upper_limit)
is_outlier.columns = data_cols + '_outlier_flag'
return is_outlier
def test() -> None:
n = 10_000
rand = default_rng(seed=0)
df = pd.DataFrame({
'key': ['a']*n + ['b']*n,
'x': np.hstack((
rand.normal(10, 1.0, size=n),
rand.normal(100, 1.0, size=n),
)),
'y': np.hstack((
rand.normal(20, 1.0, size=n),
rand.normal(200, 1.0, size=n),
)),
})
outliers = flag_outliers(df)
# optionally pd.concat() outliers to df
print(outliers)
if __name__ == '__main__':
test()
-
\$\begingroup\$ Thanks, there is a lot there for me to learn. I'm a fan of the leading comma from SQL. But helpful to see that the extra trailing commas don't cause problems. And your indenting is better. Can you explain where I mutated a function argument in-place? I used
[['x','y']]
on my groupby because the data frame has some non-numeric columns. Maybe in your approach, I should usedata_cols = df.drop(columns='key').select_dtypes(include=np.number).columns
? \$\endgroup\$Quentin– Quentin2023年06月10日 18:28:10 +00:00Commented Jun 10, 2023 at 18:28 -
\$\begingroup\$ where I mutated a function argument in-place -
df['x_outlierflag']=
(assuming that's in a function, which it should be). Or interpreted differently, you didn't mutate an input, but you also didn't capture enough of your code in a function. \$\endgroup\$Reinderien– Reinderien2023年06月10日 22:14:23 +00:00Commented Jun 10, 2023 at 22:14 -
\$\begingroup\$ I used [['x','y']] on my groupby because the data frame has some non-numeric columns - fine; but you should do that before your group operation, i.e. before passing the frame to the function. Preferably that instead of
select_dtypes
. \$\endgroup\$Reinderien– Reinderien2023年06月10日 22:15:09 +00:00Commented Jun 10, 2023 at 22:15