My goal is to aggregate a DataFrame
with several million rows including Decimal('...')
columns without any precision loss. My current implementation works - but might there be more efficient ways to get the job done?
Original DataFrame (sample)
game | position | amount |
---|---|---|
1 | a | 0.264825920790318533 |
1 | a | 0.136255564496617048 |
1 | a | 0.708422792050782986 |
1 | b | 0.102495961805507297 |
2 | a | 0.154110321496056788 |
2 | b | 0.281335658017562252 |
Aggregated DataFrame (sample)
game | amount_a | amount_b |
---|---|---|
1 | 1.109504277337718567 | 0.102495961805507297 |
2 | 0.154110321496056788 | 0.281335658017562252 |
My current approach
import pandas as pd
import numpy as np
from decimal import Decimal
from uuid import uuid4
# generating some sample DataFrame
df = pd.DataFrame(
data={
"game": [1, 1, 1, 1, 2, 2],
"position": ["a", "a", "a", "b", "a", "b"],
"amount": [
Decimal(f"{uuid4().int}"[:18]) * (Decimal(10) ** -18) for _ in range(6)
],
}
)
# aggregating...
agg_df = df.groupby(["game", "position"], as_index=False).agg(
{
"amount": "sum",
}
)
agg_df["amount_a"] = np.where(agg_df["position"] == "a", agg_df["amount"], 0)
agg_df["amount_b"] = np.where(agg_df["position"] == "b", agg_df["amount"], 0)
agg_df.drop(["amount", "position"], axis=1, inplace=True)
agg_df = agg_df.groupby(["game"], as_index=False).agg(
{
"amount_a": "sum",
"amount_b": "sum",
}
)
As far as I understand Python and Pandas, the biggest slowdown of my approach comes from the Decimal('...')
columns in combination with sum
. Since I need the precision of 10 ** -18
using the decimal package is my only option, right?
Is there a way in pandas to do the aggregation in fewer steps?
1 Answer 1
Comments above aside (this precision seems unnecessary), since you have a fixed precision you're better off operating in integer math. Numpy's uint64
will not be enough to hold a sum of millions of 18-digit integers, so use Python's built-in arbitrary-precision integers instead.
You jump through a lot of hoops to convert your grouped sum to a column for each position. This seems unnecessary and you should just accept the default multi-level index instead:
from random import seed, randrange
import pandas as pd
import numpy as np
seed(0)
amount = np.array(
[randrange(1e18) for _ in range(6)],
dtype=object,
)
# generating some sample DataFrame
df = pd.DataFrame(
data={
'game': (1, 1, 1, 1, 2, 2),
'position': ('a', 'a', 'a', 'b', 'a', 'b'),
'amount': amount,
}
)
agg_df = df.groupby(['game', 'position']).amount.sum()
print(agg_df)
game position
1 a 1051069403648066180
b 956755259049838137
2 a 412802820832641540
b 251847321245586963
Name: amount, dtype: object
1Wei = 1 * 10e-18 ETH
). So i am actually only looking for an error of at most 10e-18. The actual magnitudes and digit counts are the same as in the synthetic data. \$\endgroup\$Decimal
would be a Pandas-wrapped arbitrary-precision Python integer. \$\endgroup\$