dataframe replace (numeric) categorical values by their frequency of label = 1

Question 1

Here is my dataframe:

data = [['a1','b1',0], ['a2','b3',0], ['a1','b2',1], ['a1','b1',1], ['a2','b3',0]]
df = pd.DataFrame(data=data, columns = ['A','B','label'])

Except for 'label' column, each col is categorical value (string). I want to replace (numeric) values by their frequency of label = 1, e.g.:

n(a1) = count(A == 'a1' & label = 1)/count(A == 'a1')

I used a very stupid way by iterating columns to create a dictionary, then replace df through dictionary. Is there any more simply way?

dic = {}
for col, value in df.iteritems():
 if col != 'label':
 for cat in value.unique():
 count = df[value == cat].shape[0]
 positive = df[(value == cat) & (df['label'] == 1)].shape[0]
 dic[cat] = (positive) / count
df.replace(dic, inplace=True)

My question is to make the code concise since I naively iterate cols and values. I believe pandas has a lot of convenient functions to achieve this.

Question 2

Avoid for-loops and avoid unique() in this case. Fundamentally you're doing a grouped count, so use Pandas built-in grouping support which is vectorised. Since your numerator is filtering on label, after grouping you need to join (merge) and fillna on missing values that had no label=1.

Don't construct a dic manually. Once you have a replacement frame with a proper index based on the original values, you can just to_dict().

import pandas as pd
df = pd.DataFrame({
 'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
 'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
 'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
 grouped = df.groupby([col, 'label'])[col].count()
 positive = grouped.loc[:, 1].groupby(level=col).sum().rename('positive')
 count = grouped.groupby(level=col).sum().rename('count_')
 fractions = pd.merge(
 positive, count, how='right', left_index=True, right_index=True,
 )
 replacement = (fractions.positive.fillna(0) / fractions.count_).to_dict()
 return df[col].replace(replacement)
for col in ('A', 'B'):
 df[col] = make_counts(col)
print(df)
'''
 A B label
0 0.666667 0.5 0
1 0.000000 0.0 0
2 0.666667 1.0 1
3 0.666667 0.5 1
4 0.000000 0.0 0
'''

Assuming that "whatever you're actually doing" still uses only 0 or 1 for your labels, you should actually re-interpret this as a grouped mean rather than a grouped count:

import pandas as pd
df = pd.DataFrame({
 'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
 'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
 'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
 return df.groupby(col).label.transform('mean')
for col in ('A', 'B'):
 df[col] = make_counts(col)

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2022-06-04 14:42:27Z

Avoid for-loops and avoid unique() in this case. Fundamentally you're doing a grouped count, so use Pandas built-in grouping support which is vectorised. Since your numerator is filtering on label, after grouping you need to join (merge) and fillna on missing values that had no label=1.

Don't construct a dic manually. Once you have a replacement frame with a proper index based on the original values, you can just to_dict().

import pandas as pd
df = pd.DataFrame({
 'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
 'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
 'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
 grouped = df.groupby([col, 'label'])[col].count()
 positive = grouped.loc[:, 1].groupby(level=col).sum().rename('positive')
 count = grouped.groupby(level=col).sum().rename('count_')
 fractions = pd.merge(
 positive, count, how='right', left_index=True, right_index=True,
 )
 replacement = (fractions.positive.fillna(0) / fractions.count_).to_dict()
 return df[col].replace(replacement)
for col in ('A', 'B'):
 df[col] = make_counts(col)
print(df)
'''
 A B label
0 0.666667 0.5 0
1 0.000000 0.0 0
2 0.666667 1.0 1
3 0.666667 0.5 1
4 0.000000 0.0 0
'''

Assuming that "whatever you're actually doing" still uses only 0 or 1 for your labels, you should actually re-interpret this as a grouped mean rather than a grouped count:

import pandas as pd
df = pd.DataFrame({
 'A': ('a1', 'a2', 'a1', 'a1', 'a2'),
 'B': ('b1', 'b3', 'b2', 'b1', 'b3'),
 'label': ( 0, 0, 1, 1, 0),
})
def make_counts(col: str) -> pd.Series:
 return df.groupby(col).label.transform('mean')
for col in ('A', 'B'):
 df[col] = make_counts(col)

Stack Exchange Network

dataframe replace (numeric) categorical values by their frequency of label = 1

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

dataframe replace (numeric) categorical values by their frequency of label = 1

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions