Performance issue on updating a Pandas DataFrame with Series based on DateRange

Question 1

I have two Pandas data frames: one with Daily data and one with Weekly data. I want to add the weekly data to each row of the daily data for each group of column A.

For example, for each row on the daily data frame from 2022年07月04日 to 2022年07月09日, I want to add the weekly data from 2022年07月04日 for each group of column A and so on.

The code below reproduces the desired result:

Generate the data

import pandas as pd
import numpy as np
def generate_df(date_range):
 tf_dict = []
 for A in range(0,4000):
 for d in date_range:
 tf_dict.append({"A":A, 
 **{f"B_{i}":np.random.randint(0,10) for i in range(1,250)}, 
 "datadate": d})
 return pd.DataFrame(tf_dict)
# Daily Dataframe
daily_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='D')
df_daily = generate_df(daily_range)
# Weekly Dataframe
weekly_range = pd.date_range(start='1/1/2022', end='2/15/2022', freq='W')
df_weekly = generate_df(weekly_range)
df_weekly = df_weekly.add_prefix("higher_tf_")

Note that I took a range of 4000 for A for simplification. But with real data, A is close to 8000

Create the date range on the weekly data frame so it makes it easier to filter on the daily data frame

dfs = []
for i, dfg in df_weekly.groupby("higher_tf_A"):
 dfg = dfg.sort_values("higher_tf_datadate")
 dfg["higher_tf_next_date"] = dfg["higher_tf_datadate"].shift(-1)
 dfs.append(dfg)
df_weekly = pd.concat(dfs)

(I added the next date to the previous row, so I have a date range on the same row)

For each weekly data, create a mask grouping on 'A' and the date range. Then update the daily rows with the weekly data.

%%time
for index, row in df_weekly.iterrows():
 mask = (df_daily["A"]==row["higher_tf_A"]) & \
 (df_daily['datadate'] >= row['higher_tf_datadate']) & \
 (df_daily['datadate'] < row['higher_tf_next_date'])
 
 df_daily.loc[mask, row.index] = row.values

Results of %%time: CPU times: user 1min 59s, sys: 4.34 s, total: 2min 3s Wall time: 2min 4s

How can I improve the last code to decrease execution time?

Note that the timeframes can change (e.g. Hourly and daily, minutes and hours, ...)

Question 2

Broadly speaking, you've missed two of the most important mantras in Pandas:

Don't use loops. No, seriously. Don't use loops.
There's a thing for that.

Your generate_df is very slow. Replace it with a two-dimensional random matrix initialisation in one pass. Certainly A, and probably datadate should be index levels and not columns. There should probably also be an index level for B producing a dataframe with a triple-level index and one column, but that's beyond the scope of this question.

Don't use m/d/yyyy format; use ISO 8601.

Don't build up a dfs list only to call concat on it, don't call iterrows, and don't do manual date range comparisons. Make one call to merge_asof.

Don't use np.random.randint; that's deprecated.

Suggested

import pandas as pd
import numpy as np
from numpy.random import default_rng
rand = default_rng(seed=0)
def generate_df(
 freq: str,
 start: str = '2022-01-01', end: str = '2022-02-15',
 n_a_vals: int = 4000, n_b_cols: int = 250,
) -> pd.DataFrame:
 date_range = pd.date_range(start=start, end=end, freq=freq, name='datadate')
 index = pd.MultiIndex.from_product((
 pd.Series(data=np.arange(n_a_vals), name='A'),
 date_range,
 ))
 return pd.DataFrame(
 rand.integers(size=(len(index), n_b_cols), low=0, high=10),
 columns=[f'B_{i}' for i in range(n_b_cols)],
 index=index,
 )
df_daily = generate_df(freq='D')
df_weekly = generate_df(freq='W')
merged = pd.merge_asof(
 left=df_daily.sort_values('datadate'),
 right=df_weekly.sort_values('datadate'),
 by='A', on='datadate',
 suffixes=('', '_higher_tf'),
).set_index(['A', 'datadate']).sort_index()
print(merged)

Output

 B_0 B_1 ... B_248_higher_tf B_249_higher_tf
A datadate ... 
0 2022年01月01日 8 6 ... NaN NaN
 2022年01月02日 2 2 ... 4.0 4.0
 2022年01月03日 0 3 ... 4.0 4.0
 2022年01月04日 1 7 ... 4.0 4.0
 2022年01月05日 5 0 ... 4.0 4.0
... ... ... ... ... ...
3999 2022年02月11日 7 0 ... 3.0 1.0
 2022年02月12日 0 5 ... 3.0 1.0
 2022年02月13日 8 8 ... 5.0 1.0
 2022年02月14日 5 8 ... 5.0 1.0
 2022年02月15日 1 9 ... 5.0 1.0
[184000 rows x 500 columns]

Runs in about three seconds.

Question 3

I'm very impressed; many thanks for your time. Haven't you ever encountered cases where you needed to loop in pandas? May I ask you to elaborate on the index level topic?

Question 4

Haven't you ever encountered cases where you needed to loop in pandas? - Yes but they're rare. For all "usual", and many/most "unusual" situations, still avoid loops and use vectorised operations.

Question 5

Re. index levels: chat

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-07-10 02:00:45Z

Broadly speaking, you've missed two of the most important mantras in Pandas:

Don't use loops. No, seriously. Don't use loops.
There's a thing for that.

Your generate_df is very slow. Replace it with a two-dimensional random matrix initialisation in one pass. Certainly A, and probably datadate should be index levels and not columns. There should probably also be an index level for B producing a dataframe with a triple-level index and one column, but that's beyond the scope of this question.

Don't use m/d/yyyy format; use ISO 8601.

Don't build up a dfs list only to call concat on it, don't call iterrows, and don't do manual date range comparisons. Make one call to merge_asof.

Don't use np.random.randint; that's deprecated.

Suggested

import pandas as pd
import numpy as np
from numpy.random import default_rng
rand = default_rng(seed=0)
def generate_df(
 freq: str,
 start: str = '2022-01-01', end: str = '2022-02-15',
 n_a_vals: int = 4000, n_b_cols: int = 250,
) -> pd.DataFrame:
 date_range = pd.date_range(start=start, end=end, freq=freq, name='datadate')
 index = pd.MultiIndex.from_product((
 pd.Series(data=np.arange(n_a_vals), name='A'),
 date_range,
 ))
 return pd.DataFrame(
 rand.integers(size=(len(index), n_b_cols), low=0, high=10),
 columns=[f'B_{i}' for i in range(n_b_cols)],
 index=index,
 )
df_daily = generate_df(freq='D')
df_weekly = generate_df(freq='W')
merged = pd.merge_asof(
 left=df_daily.sort_values('datadate'),
 right=df_weekly.sort_values('datadate'),
 by='A', on='datadate',
 suffixes=('', '_higher_tf'),
).set_index(['A', 'datadate']).sort_index()
print(merged)

Output

 B_0 B_1 ... B_248_higher_tf B_249_higher_tf
A datadate ... 
0 2022年01月01日 8 6 ... NaN NaN
 2022年01月02日 2 2 ... 4.0 4.0
 2022年01月03日 0 3 ... 4.0 4.0
 2022年01月04日 1 7 ... 4.0 4.0
 2022年01月05日 5 0 ... 4.0 4.0
... ... ... ... ... ...
3999 2022年02月11日 7 0 ... 3.0 1.0
 2022年02月12日 0 5 ... 3.0 1.0
 2022年02月13日 8 8 ... 5.0 1.0
 2022年02月14日 5 8 ... 5.0 1.0
 2022年02月15日 1 9 ... 5.0 1.0
[184000 rows x 500 columns]

Runs in about three seconds.

I'm very impressed; many thanks for your time. Haven't you ever encountered cases where you needed to loop in pandas? May I ask you to elaborate on the index level topic?
Haven't you ever encountered cases where you needed to loop in pandas? - Yes but they're rare. For all "usual", and many/most "unusual" situations, still avoid loops and use vectorised operations.

Stack Exchange Network

Performance issue on updating a Pandas DataFrame with Series based on DateRange

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performance issue on updating a Pandas DataFrame with Series based on DateRange

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions