1
\$\begingroup\$

I have two Pandas data frames: one with Daily data and one with Weekly data. I want to add the weekly data to each row of the daily data for each group of column A.

For example, for each row on the daily data frame from 2022年07月04日 to 2022年07月09日, I want to add the weekly data from 2022年07月04日 for each group of column A and so on.

The code below reproduces the desired result:

  1. Generate the data
import pandas as pd
import numpy as np
def generate_df(date_range):
 tf_dict = []
 for A in range(0,4000):
 for d in date_range:
 tf_dict.append({"A":A, 
 **{f"B_{i}":np.random.randint(0,10) for i in range(1,250)}, 
 "datadate": d})
 return pd.DataFrame(tf_dict)
# Daily Dataframe
daily_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='D')
df_daily = generate_df(daily_range)
# Weekly Dataframe
weekly_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='W')
df_weekly = generate_df(weekly_range)
df_weekly = df_weekly.add_prefix("higher_tf_")

Note that I took a range of 4000 for A for simplification. But with real data, A is close to 8000

  1. Create the date range on the weekly data frame so it makes it easier to filter on the daily data frame
dfs = []
for i, dfg in df_weekly.groupby("higher_tf_A"):
 dfg = dfg.sort_values("higher_tf_datadate")
 dfg["higher_tf_next_date"] = dfg["higher_tf_datadate"].shift(-1)
 dfs.append(dfg)
df_weekly = pd.concat(dfs)

(I added the next date to the previous row, so I have a date range on the same row)

  1. For each weekly data, create a mask grouping on 'A' and the date range. Then update the daily rows with the weekly data.
%%time
for index, row in df_weekly.iterrows():
 mask = (df_daily["A"]==row["higher_tf_A"]) & \
 (df_daily['datadate'] >= row['higher_tf_datadate']) & \
 (df_daily['datadate'] < row['higher_tf_next_date'])
 
 df_daily.loc[mask, row.index] = row.values

Results of %%time: CPU times: user 1min 59s, sys: 4.34 s, total: 2min 3s Wall time: 2min 4s

How can I improve the last code to decrease execution time?

Note that the timeframes can change (e.g. Hourly and daily, minutes and hours, ...)

asked Jul 9, 2022 at 11:04
\$\endgroup\$
0

1 Answer 1

1
\$\begingroup\$

Broadly speaking, you've missed two of the most important mantras in Pandas:

  • Don't use loops. No, seriously. Don't use loops.
  • There's a thing for that.

Your generate_df is very slow. Replace it with a two-dimensional random matrix initialisation in one pass. Certainly A, and probably datadate should be index levels and not columns. There should probably also be an index level for B producing a dataframe with a triple-level index and one column, but that's beyond the scope of this question.

Don't use m/d/yyyy format; use ISO 8601.

Don't build up a dfs list only to call concat on it, don't call iterrows, and don't do manual date range comparisons. Make one call to merge_asof.

Don't use np.random.randint; that's deprecated.

Suggested

import pandas as pd
import numpy as np
from numpy.random import default_rng
rand = default_rng(seed=0)
def generate_df(
 freq: str,
 start: str = '2022-01-01', end: str = '2022-02-15',
 n_a_vals: int = 4000, n_b_cols: int = 250,
) -> pd.DataFrame:
 date_range = pd.date_range(start=start, end=end, freq=freq, name='datadate')
 index = pd.MultiIndex.from_product((
 pd.Series(data=np.arange(n_a_vals), name='A'),
 date_range,
 ))
 return pd.DataFrame(
 rand.integers(size=(len(index), n_b_cols), low=0, high=10),
 columns=[f'B_{i}' for i in range(n_b_cols)],
 index=index,
 )
df_daily = generate_df(freq='D')
df_weekly = generate_df(freq='W')
merged = pd.merge_asof(
 left=df_daily.sort_values('datadate'),
 right=df_weekly.sort_values('datadate'),
 by='A', on='datadate',
 suffixes=('', '_higher_tf'),
).set_index(['A', 'datadate']).sort_index()
print(merged)

Output

 B_0 B_1 ... B_248_higher_tf B_249_higher_tf
A datadate ... 
0 2022年01月01日 8 6 ... NaN NaN
 2022年01月02日 2 2 ... 4.0 4.0
 2022年01月03日 0 3 ... 4.0 4.0
 2022年01月04日 1 7 ... 4.0 4.0
 2022年01月05日 5 0 ... 4.0 4.0
... ... ... ... ... ...
3999 2022年02月11日 7 0 ... 3.0 1.0
 2022年02月12日 0 5 ... 3.0 1.0
 2022年02月13日 8 8 ... 5.0 1.0
 2022年02月14日 5 8 ... 5.0 1.0
 2022年02月15日 1 9 ... 5.0 1.0
[184000 rows x 500 columns]

Runs in about three seconds.

answered Jul 10, 2022 at 2:00
\$\endgroup\$
3
  • \$\begingroup\$ I'm very impressed; many thanks for your time. Haven't you ever encountered cases where you needed to loop in pandas? May I ask you to elaborate on the index level topic? \$\endgroup\$ Commented Jul 10, 2022 at 13:46
  • \$\begingroup\$ Haven't you ever encountered cases where you needed to loop in pandas? - Yes but they're rare. For all "usual", and many/most "unusual" situations, still avoid loops and use vectorised operations. \$\endgroup\$ Commented Jul 10, 2022 at 14:25
  • \$\begingroup\$ Re. index levels: chat \$\endgroup\$ Commented Jul 10, 2022 at 14:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.