Grouping and summing only n variables of m with m>n using a column as key in pandas

Question 1

I have the following df

df_dict = {"week":[1,1,1,4,5],
 "store":["A","B","C","A","C"],
 "var": [1,1,1,1,1]}
df = pd.DataFrame(df_dict)
 week store var
0 1 A 1
1 1 B 1
2 1 C 1
3 4 A 1
4 5 C 1

My goal is to sum variable A and C by week, but not variable B.

df["store"] = df["store"].str.replace("A","X")
df["store"] = df["store"].str.replace("C","X")

And this is the final output

df.groupby(by=["week","store"]).sum().reset_index()
week store var
0 1 B 1
1 1 X 2
2 4 X 1
3 5 X 1

The code works perfectly but I am pretty sure there is a better way to do that in pandas

Question 2

Probably my biggest issue with the current implementation is that X is not a very good placeholder - we should prefer NaN instead - and even over B, you still sum. I would sooner split the data between grouped and non-grouped, with no store replacement:

import pandas as pd
df = pd.DataFrame({
 "week": (1,1,1,4,5),
 "store": ("A","B","C","A","C"),
 "var": (1,1,1,1,1),
})
should_group = df.store != 'B'
totals = (
 df[should_group]
 .groupby('week')['var']
 .sum()
 .reset_index()
)
result = pd.concat((
 df[~should_group], totals
), ignore_index=True)
print(result)

 week store var
0 1 B 1
1 1 NaN 2
2 4 NaN 1
3 5 NaN 1

Question 3

You are right for the X issue, I omitted some details. The reason is because X, in my actual problem, is just the aggregation of two stores and at the end I have to save an .xlsx file where this can be understandable (Eg. In week 1 X contributed for 2) . But I can add a .replace() at the end after the manipulation.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-07-30 14:54:43Z

Probably my biggest issue with the current implementation is that X is not a very good placeholder - we should prefer NaN instead - and even over B, you still sum. I would sooner split the data between grouped and non-grouped, with no store replacement:

import pandas as pd
df = pd.DataFrame({
 "week": (1,1,1,4,5),
 "store": ("A","B","C","A","C"),
 "var": (1,1,1,1,1),
})
should_group = df.store != 'B'
totals = (
 df[should_group]
 .groupby('week')['var']
 .sum()
 .reset_index()
)
result = pd.concat((
 df[~should_group], totals
), ignore_index=True)
print(result)

 week store var
0 1 B 1
1 1 NaN 2
2 4 NaN 1
3 5 NaN 1

You are right for the X issue, I omitted some details. The reason is because X, in my actual problem, is just the aggregation of two stores and at the end I have to save an .xlsx file where this can be understandable (Eg. In week 1 X contributed for 2) . But I can add a .replace() at the end after the manipulation.

Stack Exchange Network

Grouping and summing only n variables of m with m>n using a column as key in pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Grouping and summing only n variables of m with m>n using a column as key in pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions