Here is my dataframe:
data = [['a1','b1',0], ['a2','b3',0], ['a1','b2',1], ['a1','b1',1], ['a2','b3',0]]
df = pd.DataFrame(data=data, columns = ['A','B','label'])
Except for 'label' column, each col is categorical value (string). I want to replace (numeric) values by their frequency of label = 1, e.g.:
n(a1) = count(A == 'a1' & label = 1)/count(A == 'a1')
I used a very stupid way by iterating columns to create a dictionary, then replace df through dictionary. Is there any more simply way?
dic = {}
for col, value in df.iteritems():
if col != 'label':
for cat in value.unique():
count = df[value == cat].shape[0]
positive = df[(value == cat) & (df['label'] == 1)].shape[0]
dic[cat] = (positive) / count
df.replace(dic, inplace=True)
My question is to make the code concise since I naively iterate cols and values. I believe pandas has a lot of convenient functions to achieve this.
1 Answer 1
Avoid for
-loops and avoid unique()
in this case. Fundamentally you're doing a grouped count, so use Pandas built-in grouping support which is vectorised. Since your numerator is filtering on label
, after grouping you need to join (merge
) and fillna
on missing values that had no label=1
.
Don't construct a dic
manually. Once you have a replacement frame with a proper index based on the original values, you can just to_dict()
.
import pandas as pd
df = pd.DataFrame({
'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
grouped = df.groupby([col, 'label'])[col].count()
positive = grouped.loc[:, 1].groupby(level=col).sum().rename('positive')
count = grouped.groupby(level=col).sum().rename('count_')
fractions = pd.merge(
positive, count, how='right', left_index=True, right_index=True,
)
replacement = (fractions.positive.fillna(0) / fractions.count_).to_dict()
return df[col].replace(replacement)
for col in ('A', 'B'):
df[col] = make_counts(col)
print(df)
'''
A B label
0 0.666667 0.5 0
1 0.000000 0.0 0
2 0.666667 1.0 1
3 0.666667 0.5 1
4 0.000000 0.0 0
'''
Assuming that "whatever you're actually doing" still uses only 0 or 1 for your labels, you should actually re-interpret this as a grouped mean rather than a grouped count:
import pandas as pd
df = pd.DataFrame({
'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
return df.groupby(col).label.transform('mean')
for col in ('A', 'B'):
df[col] = make_counts(col)