Multi-step aggregation of cryptocurrency data with high precision

Question 1

My goal is to aggregate a DataFrame with several million rows including Decimal('...') columns without any precision loss. My current implementation works - but might there be more efficient ways to get the job done?

Original DataFrame (sample)

game	position	amount
1	a	0.264825920790318533
1	a	0.136255564496617048
1	a	0.708422792050782986
1	b	0.102495961805507297
2	a	0.154110321496056788
2	b	0.281335658017562252

Aggregated DataFrame (sample)

game	amount_a	amount_b
1	1.109504277337718567	0.102495961805507297
2	0.154110321496056788	0.281335658017562252

My current approach

import pandas as pd
import numpy as np
from decimal import Decimal
from uuid import uuid4
# generating some sample DataFrame
df = pd.DataFrame(
 data={
 "game": [1, 1, 1, 1, 2, 2],
 "position": ["a", "a", "a", "b", "a", "b"],
 "amount": [
 Decimal(f"{uuid4().int}"[:18]) * (Decimal(10) ** -18) for _ in range(6)
 ],
 }
)
# aggregating...
agg_df = df.groupby(["game", "position"], as_index=False).agg(
 {
 "amount": "sum",
 }
)
agg_df["amount_a"] = np.where(agg_df["position"] == "a", agg_df["amount"], 0)
agg_df["amount_b"] = np.where(agg_df["position"] == "b", agg_df["amount"], 0)
agg_df.drop(["amount", "position"], axis=1, inplace=True)
agg_df = agg_df.groupby(["game"], as_index=False).agg(
 {
 "amount_a": "sum",
 "amount_b": "sum",
 }
)

As far as I understand Python and Pandas, the biggest slowdown of my approach comes from the Decimal('...') columns in combination with sum. Since I need the precision of 10 ** -18 using the decimal package is my only option, right?

Is there a way in pandas to do the aggregation in fewer steps?

Question 2

"Without precision loss" is a loaded statement, and implies no loss of precision whatsoever - which contradicts your other statement where you tolerate an error of at most 10e-18. Which is it?

Question 3

The data you've shown are synthetic. Are the actual magnitudes and digit counts the same as your synthetic data?

Question 4

Finally: what is "game" in this context? What are you actually trying to do?

Question 5

yes you are right about the precision - underlying data is ETH with Wei as smallest denomination (1Wei = 1 * 10e-18 ETH). So i am actually only looking for an error of at most 10e-18. The actual magnitudes and digit counts are the same as in the synthetic data.

Question 6

Keep in mind that one hundred million million Wei is currently worth about 37 US cents. Are you absolutely sure that you need this precision? If so, you basically can't use Numpy built-in types which only go to float64 or int64. About your only option better than a Pandas-wrapped Decimal would be a Pandas-wrapped arbitrary-precision Python integer.

Question 7

Comments above aside (this precision seems unnecessary), since you have a fixed precision you're better off operating in integer math. Numpy's uint64 will not be enough to hold a sum of millions of 18-digit integers, so use Python's built-in arbitrary-precision integers instead.

You jump through a lot of hoops to convert your grouped sum to a column for each position. This seems unnecessary and you should just accept the default multi-level index instead:

from random import seed, randrange
import pandas as pd
import numpy as np
seed(0)
amount = np.array(
 [randrange(1e18) for _ in range(6)],
 dtype=object,
)
# generating some sample DataFrame
df = pd.DataFrame(
 data={
 'game': (1, 1, 1, 1, 2, 2),
 'position': ('a', 'a', 'a', 'b', 'a', 'b'),
 'amount': amount,
 }
)
agg_df = df.groupby(['game', 'position']).amount.sum()
print(agg_df)

game position
1 a 1051069403648066180
 b 956755259049838137
2 a 412802820832641540
 b 251847321245586963
Name: amount, dtype: object

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-01-01 16:02:11Z

Comments above aside (this precision seems unnecessary), since you have a fixed precision you're better off operating in integer math. Numpy's uint64 will not be enough to hold a sum of millions of 18-digit integers, so use Python's built-in arbitrary-precision integers instead.

You jump through a lot of hoops to convert your grouped sum to a column for each position. This seems unnecessary and you should just accept the default multi-level index instead:

from random import seed, randrange
import pandas as pd
import numpy as np
seed(0)
amount = np.array(
 [randrange(1e18) for _ in range(6)],
 dtype=object,
)
# generating some sample DataFrame
df = pd.DataFrame(
 data={
 'game': (1, 1, 1, 1, 2, 2),
 'position': ('a', 'a', 'a', 'b', 'a', 'b'),
 'amount': amount,
 }
)
agg_df = df.groupby(['game', 'position']).amount.sum()
print(agg_df)

game position
1 a 1051069403648066180
 b 956755259049838137
2 a 412802820832641540
 b 251847321245586963
Name: amount, dtype: object

Stack Exchange Network

Multi-step aggregation of cryptocurrency data with high precision

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multi-step aggregation of cryptocurrency data with high precision

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions