I have the following df
df_dict = {"week":[1,1,1,4,5],
"store":["A","B","C","A","C"],
"var": [1,1,1,1,1]}
df = pd.DataFrame(df_dict)
week store var
0 1 A 1
1 1 B 1
2 1 C 1
3 4 A 1
4 5 C 1
My goal is to sum variable A
and C
by week, but not variable B
.
df["store"] = df["store"].str.replace("A","X")
df["store"] = df["store"].str.replace("C","X")
And this is the final output
df.groupby(by=["week","store"]).sum().reset_index()
week store var
0 1 B 1
1 1 X 2
2 4 X 1
3 5 X 1
The code works perfectly but I am pretty sure there is a better way to do that in pandas
1 Answer 1
Probably my biggest issue with the current implementation is that X
is not a very good placeholder - we should prefer NaN
instead - and even over B
, you still sum. I would sooner split the data between grouped and non-grouped, with no store
replacement:
import pandas as pd
df = pd.DataFrame({
"week": (1,1,1,4,5),
"store": ("A","B","C","A","C"),
"var": (1,1,1,1,1),
})
should_group = df.store != 'B'
totals = (
df[should_group]
.groupby('week')['var']
.sum()
.reset_index()
)
result = pd.concat((
df[~should_group], totals
), ignore_index=True)
print(result)
week store var
0 1 B 1
1 1 NaN 2
2 4 NaN 1
3 5 NaN 1
-
\$\begingroup\$ You are right for the
X
issue, I omitted some details. The reason is becauseX
, in my actual problem, is just the aggregation of two stores and at the end I have to save an.xlsx
file where this can be understandable (Eg. In week 1 X contributed for 2) . But I can add a.replace()
at the end after the manipulation. \$\endgroup\$Andrea Ciufo– Andrea Ciufo2022年07月31日 08:36:37 +00:00Commented Jul 31, 2022 at 8:36