1
\$\begingroup\$

My goal is to aggregate a DataFrame with several million rows including Decimal('...') columns without any precision loss. My current implementation works - but might there be more efficient ways to get the job done?

Original DataFrame (sample)

game position amount
1 a 0.264825920790318533
1 a 0.136255564496617048
1 a 0.708422792050782986
1 b 0.102495961805507297
2 a 0.154110321496056788
2 b 0.281335658017562252

Aggregated DataFrame (sample)

game amount_a amount_b
1 1.109504277337718567 0.102495961805507297
2 0.154110321496056788 0.281335658017562252

My current approach

import pandas as pd
import numpy as np
from decimal import Decimal
from uuid import uuid4
# generating some sample DataFrame
df = pd.DataFrame(
 data={
 "game": [1, 1, 1, 1, 2, 2],
 "position": ["a", "a", "a", "b", "a", "b"],
 "amount": [
 Decimal(f"{uuid4().int}"[:18]) * (Decimal(10) ** -18) for _ in range(6)
 ],
 }
)
# aggregating...
agg_df = df.groupby(["game", "position"], as_index=False).agg(
 {
 "amount": "sum",
 }
)
agg_df["amount_a"] = np.where(agg_df["position"] == "a", agg_df["amount"], 0)
agg_df["amount_b"] = np.where(agg_df["position"] == "b", agg_df["amount"], 0)
agg_df.drop(["amount", "position"], axis=1, inplace=True)
agg_df = agg_df.groupby(["game"], as_index=False).agg(
 {
 "amount_a": "sum",
 "amount_b": "sum",
 }
)

As far as I understand Python and Pandas, the biggest slowdown of my approach comes from the Decimal('...') columns in combination with sum. Since I need the precision of 10 ** -18 using the decimal package is my only option, right?

Is there a way in pandas to do the aggregation in fewer steps?

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Dec 31, 2021 at 15:25
\$\endgroup\$
5
  • \$\begingroup\$ "Without precision loss" is a loaded statement, and implies no loss of precision whatsoever - which contradicts your other statement where you tolerate an error of at most 10e-18. Which is it? \$\endgroup\$ Commented Dec 31, 2021 at 21:05
  • \$\begingroup\$ The data you've shown are synthetic. Are the actual magnitudes and digit counts the same as your synthetic data? \$\endgroup\$ Commented Dec 31, 2021 at 21:21
  • \$\begingroup\$ Finally: what is "game" in this context? What are you actually trying to do? \$\endgroup\$ Commented Dec 31, 2021 at 21:23
  • \$\begingroup\$ yes you are right about the precision - underlying data is ETH with Wei as smallest denomination (1Wei = 1 * 10e-18 ETH). So i am actually only looking for an error of at most 10e-18. The actual magnitudes and digit counts are the same as in the synthetic data. \$\endgroup\$ Commented Jan 1, 2022 at 10:15
  • \$\begingroup\$ Keep in mind that one hundred million million Wei is currently worth about 37 US cents. Are you absolutely sure that you need this precision? If so, you basically can't use Numpy built-in types which only go to float64 or int64. About your only option better than a Pandas-wrapped Decimal would be a Pandas-wrapped arbitrary-precision Python integer. \$\endgroup\$ Commented Jan 1, 2022 at 15:44

1 Answer 1

2
\$\begingroup\$

Comments above aside (this precision seems unnecessary), since you have a fixed precision you're better off operating in integer math. Numpy's uint64 will not be enough to hold a sum of millions of 18-digit integers, so use Python's built-in arbitrary-precision integers instead.

You jump through a lot of hoops to convert your grouped sum to a column for each position. This seems unnecessary and you should just accept the default multi-level index instead:

from random import seed, randrange
import pandas as pd
import numpy as np
seed(0)
amount = np.array(
 [randrange(1e18) for _ in range(6)],
 dtype=object,
)
# generating some sample DataFrame
df = pd.DataFrame(
 data={
 'game': (1, 1, 1, 1, 2, 2),
 'position': ('a', 'a', 'a', 'b', 'a', 'b'),
 'amount': amount,
 }
)
agg_df = df.groupby(['game', 'position']).amount.sum()
print(agg_df)
game position
1 a 1051069403648066180
 b 956755259049838137
2 a 412802820832641540
 b 251847321245586963
Name: amount, dtype: object
answered Jan 1, 2022 at 16:02
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.