I have two Pandas data frames: one with Daily data and one with Weekly data. I want to add the weekly data to each row of the daily data for each group of column A.
For example, for each row on the daily data frame from 2022年07月04日 to 2022年07月09日, I want to add the weekly data from 2022年07月04日 for each group of column A and so on.
The code below reproduces the desired result:
- Generate the data
import pandas as pd
import numpy as np
def generate_df(date_range):
tf_dict = []
for A in range(0,4000):
for d in date_range:
tf_dict.append({"A":A,
**{f"B_{i}":np.random.randint(0,10) for i in range(1,250)},
"datadate": d})
return pd.DataFrame(tf_dict)
# Daily Dataframe
daily_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='D')
df_daily = generate_df(daily_range)
# Weekly Dataframe
weekly_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='W')
df_weekly = generate_df(weekly_range)
df_weekly = df_weekly.add_prefix("higher_tf_")
Note that I took a range of 4000 for A
for simplification. But with real data, A is close to 8000
- Create the date range on the weekly data frame so it makes it easier to filter on the daily data frame
dfs = []
for i, dfg in df_weekly.groupby("higher_tf_A"):
dfg = dfg.sort_values("higher_tf_datadate")
dfg["higher_tf_next_date"] = dfg["higher_tf_datadate"].shift(-1)
dfs.append(dfg)
df_weekly = pd.concat(dfs)
(I added the next date to the previous row, so I have a date range on the same row)
- For each weekly data, create a mask grouping on 'A' and the date range. Then update the daily rows with the weekly data.
%%time
for index, row in df_weekly.iterrows():
mask = (df_daily["A"]==row["higher_tf_A"]) & \
(df_daily['datadate'] >= row['higher_tf_datadate']) & \
(df_daily['datadate'] < row['higher_tf_next_date'])
df_daily.loc[mask, row.index] = row.values
Results of %%time
:
CPU times: user 1min 59s, sys: 4.34 s, total: 2min 3s Wall time: 2min 4s
How can I improve the last code to decrease execution time?
Note that the timeframes can change (e.g. Hourly and daily, minutes and hours, ...)
1 Answer 1
Broadly speaking, you've missed two of the most important mantras in Pandas:
- Don't use loops. No, seriously. Don't use loops.
- There's a thing for that.
Your generate_df
is very slow. Replace it with a two-dimensional random matrix initialisation in one pass. Certainly A
, and probably datadate
should be index levels and not columns. There should probably also be an index level for B
producing a dataframe with a triple-level index and one column, but that's beyond the scope of this question.
Don't use m/d/yyyy
format; use ISO 8601.
Don't build up a dfs
list only to call concat
on it, don't call iterrows
, and don't do manual date range comparisons. Make one call to merge_asof
.
Don't use np.random.randint
; that's deprecated.
Suggested
import pandas as pd
import numpy as np
from numpy.random import default_rng
rand = default_rng(seed=0)
def generate_df(
freq: str,
start: str = '2022-01-01', end: str = '2022-02-15',
n_a_vals: int = 4000, n_b_cols: int = 250,
) -> pd.DataFrame:
date_range = pd.date_range(start=start, end=end, freq=freq, name='datadate')
index = pd.MultiIndex.from_product((
pd.Series(data=np.arange(n_a_vals), name='A'),
date_range,
))
return pd.DataFrame(
rand.integers(size=(len(index), n_b_cols), low=0, high=10),
columns=[f'B_{i}' for i in range(n_b_cols)],
index=index,
)
df_daily = generate_df(freq='D')
df_weekly = generate_df(freq='W')
merged = pd.merge_asof(
left=df_daily.sort_values('datadate'),
right=df_weekly.sort_values('datadate'),
by='A', on='datadate',
suffixes=('', '_higher_tf'),
).set_index(['A', 'datadate']).sort_index()
print(merged)
Output
B_0 B_1 ... B_248_higher_tf B_249_higher_tf
A datadate ...
0 2022年01月01日 8 6 ... NaN NaN
2022年01月02日 2 2 ... 4.0 4.0
2022年01月03日 0 3 ... 4.0 4.0
2022年01月04日 1 7 ... 4.0 4.0
2022年01月05日 5 0 ... 4.0 4.0
... ... ... ... ... ...
3999 2022年02月11日 7 0 ... 3.0 1.0
2022年02月12日 0 5 ... 3.0 1.0
2022年02月13日 8 8 ... 5.0 1.0
2022年02月14日 5 8 ... 5.0 1.0
2022年02月15日 1 9 ... 5.0 1.0
[184000 rows x 500 columns]
Runs in about three seconds.
-
\$\begingroup\$ I'm very impressed; many thanks for your time. Haven't you ever encountered cases where you needed to loop in pandas? May I ask you to elaborate on the index level topic? \$\endgroup\$Begoodpy– Begoodpy2022年07月10日 13:46:54 +00:00Commented Jul 10, 2022 at 13:46
-
\$\begingroup\$ Haven't you ever encountered cases where you needed to loop in pandas? - Yes but they're rare. For all "usual", and many/most "unusual" situations, still avoid loops and use vectorised operations. \$\endgroup\$Reinderien– Reinderien2022年07月10日 14:25:22 +00:00Commented Jul 10, 2022 at 14:25
-
\$\begingroup\$ Re. index levels: chat \$\endgroup\$Reinderien– Reinderien2022年07月10日 14:26:40 +00:00Commented Jul 10, 2022 at 14:26