Optimize mapping over large DF

Question 1

I have two dataframes I need to map over and combine into one. The first data frame contains NBA players. The headers are 'Date', 'Player', 'Team', 'Position', 'Salary', 'Pos_ID', 'Minutes', 'FPTS', 'USG'. I then have a second dataframe that's the first dataframe but grouped by date and team. The headers for this df are 'Date', 'Team', 'Minutes', 'FGA', 'FTA', 'To'. I'm trying to calculate USG rate for every player in the first dataframe. To do this I need to know what the total minutes, field goal attempts, free throw attempts, and turnover are for each team in each game for a given date. I then divide the same player's stats by the team's total stats. I have a working solution, but it's really slow and doesn't seem to be the most efficient way of doing this.

Here is the code:

import pandas as pd
player_df = pd.read_csv('Sample Data') # replace with sample data file
no_dups = player_df.drop_duplicates()
no_dups.loc[:, 'USG'] = pd.Series(dtype=float)
no_dups = no_dups[no_dups.Minutes != 0]
grouped_teams = no_dups.groupby(['Date', 'Team']).agg({'Minutes':['sum'], 'FGA': ['sum'], 'FTA': ['sum'], 'TO': ['sum'] })
grouped_teams.columns = ['Minutes', 'FGA', 'FTA', 'TO']
grouped_teams = grouped_teams.reset_index()
for index, row in no_dups.iterrows():
 for i, r in grouped_teams.iterrows():
 if no_dups.at[index, 'Team'] == grouped_teams.at[i, 'Team'] and no_dups.at[index, 'Date'] == grouped_teams.at[i, 'Date']:
 no_dups.at[index, 'USG'] = (100*((no_dups.at[index, 'FGA'] + 0.44 * no_dups.at[index, 'FTA'] + no_dups.at[index, 'TO'])*(grouped_teams.at[i, 'Minutes']/5))) / (no_dups.at[index, 'Minutes']*(grouped_teams.at[i, 'FGA']+0.44*grouped_teams.at[i, 'FTA']+grouped_teams.at[i, 'TO']))
 
final_df = no_dups[['Date', 'Player', 'Team', 'Position', 'Salary', 'Minutes', 'FPTS', 'USG']]
print(final_df)

I have removed all players who didn't play and there are duplicates because the same player can play in multiple contests in a single night so I removed those. I then create a df called grouped_teams which is every single team in the df grouped by date and team name. I then iterate over the first df using iterrows and the second df the same way. I need to find each players team and specific date and divide his stats by the calculated total to get the usage rate. The column for that is no_dups.at[index, 'USG']. There are 73k rows in my df so iterating over each one is taking a very long time.

Sample Data

Question 2

I have two dataframes I need to map over and combine into one

Kind of not really. You have one dataframe and you need to apply what in SQL is called a windowing operation; in Pandas it's a call to transform(). Doing this properly allows the full-size dataset to be processed in about 6 ms.

This operation:

grouped_teams.at[i, 'FGA']+0.44*grouped_teams.at[i, 'FTA']+grouped_teams.at[i, 'TO']

is better represented by a matrix product, and since you've written it out twice, put it in a utility function.

It looks like the dates haven't been properly parsed, but I don't address this in the demo.

import time
import pandas as pd
def factor_product(df: pd.DataFrame) -> pd.Series:
 return df[['FGA', 'FTA', 'TO']]@(1, 0.44, 1) / df['Minutes']
def process(player_df: pd.DataFrame) -> pd.DataFrame:
 no_dups = player_df.drop_duplicates()
 no_dups = no_dups[no_dups['Minutes'] > 0]
 sums = no_dups.groupby(['Date', 'Team'])[[
 'Minutes', 'FGA', 'FTA', 'TO',
 ]].transform('sum')
 no_dups['USG'] = 100/5 * factor_product(no_dups)/factor_product(sums)
 return no_dups[[
 'Date', 'Player', 'Team', 'Position', 'Salary', 'Minutes', 'FPTS', 'USG',
 ]]
def regression_test() -> None:
 player_df = pd.read_csv('Sample Data.csv', nrows=100)
 final_df = process(player_df)
 # First lap
 # final_df.to_csv('reference.csv', index=False)
 reference = pd.read_csv('reference.csv')
 pd.testing.assert_frame_equal(reference, final_df)
def full_demo() -> None:
 player_df = pd.read_csv('Sample Data.csv')
 start = time.perf_counter()
 final_df = process(player_df)
 dur = time.perf_counter() - start
 print(f'{1e3*dur:.1f} ms')
 print(final_df)
if __name__ == '__main__':
 regression_test()
 full_demo()

6.1 ms
 Date Player Team ... Minutes FPTS USG
0 2018年01月01日 Allen Crabbe BKN ... 28 35.50 17.266390
1 2018年01月01日 Caris Levert BKN ... 27 35.25 32.796691
2 2018年01月01日 DeMarre Carroll BKN ... 29 33.50 18.446945
3 2018年01月01日 Jarrett Allen BKN ... 20 27.50 20.268724
4 2018年01月01日 Rondae Hollis-Jefferson BKN ... 38 26.25 19.280646
... ... ... ... ... ... ... ...
4488 1/31/18 Jared Dudley PHO ... 19 13.00 9.242619
4489 1/31/18 Troy Daniels PHO ... 19 13.25 16.174583
4490 1/31/18 Dragan Bender PHO ... 20 12.25 15.365854
4491 1/31/18 Isaiah Canaan PHO ... 4 6.50 32.926829
4492 1/31/18 Tyler Ulis PHO ... 10 5.00 21.951220
[4493 rows x 8 columns]

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2024-12-18 00:33:30Z

I have two dataframes I need to map over and combine into one

Kind of not really. You have one dataframe and you need to apply what in SQL is called a windowing operation; in Pandas it's a call to transform(). Doing this properly allows the full-size dataset to be processed in about 6 ms.

This operation:

grouped_teams.at[i, 'FGA']+0.44*grouped_teams.at[i, 'FTA']+grouped_teams.at[i, 'TO']

is better represented by a matrix product, and since you've written it out twice, put it in a utility function.

It looks like the dates haven't been properly parsed, but I don't address this in the demo.

import time
import pandas as pd
def factor_product(df: pd.DataFrame) -> pd.Series:
 return df[['FGA', 'FTA', 'TO']]@(1, 0.44, 1) / df['Minutes']
def process(player_df: pd.DataFrame) -> pd.DataFrame:
 no_dups = player_df.drop_duplicates()
 no_dups = no_dups[no_dups['Minutes'] > 0]
 sums = no_dups.groupby(['Date', 'Team'])[[
 'Minutes', 'FGA', 'FTA', 'TO',
 ]].transform('sum')
 no_dups['USG'] = 100/5 * factor_product(no_dups)/factor_product(sums)
 return no_dups[[
 'Date', 'Player', 'Team', 'Position', 'Salary', 'Minutes', 'FPTS', 'USG',
 ]]
def regression_test() -> None:
 player_df = pd.read_csv('Sample Data.csv', nrows=100)
 final_df = process(player_df)
 # First lap
 # final_df.to_csv('reference.csv', index=False)
 reference = pd.read_csv('reference.csv')
 pd.testing.assert_frame_equal(reference, final_df)
def full_demo() -> None:
 player_df = pd.read_csv('Sample Data.csv')
 start = time.perf_counter()
 final_df = process(player_df)
 dur = time.perf_counter() - start
 print(f'{1e3*dur:.1f} ms')
 print(final_df)
if __name__ == '__main__':
 regression_test()
 full_demo()

6.1 ms
 Date Player Team ... Minutes FPTS USG
0 2018年01月01日 Allen Crabbe BKN ... 28 35.50 17.266390
1 2018年01月01日 Caris Levert BKN ... 27 35.25 32.796691
2 2018年01月01日 DeMarre Carroll BKN ... 29 33.50 18.446945
3 2018年01月01日 Jarrett Allen BKN ... 20 27.50 20.268724
4 2018年01月01日 Rondae Hollis-Jefferson BKN ... 38 26.25 19.280646
... ... ... ... ... ... ... ...
4488 1/31/18 Jared Dudley PHO ... 19 13.00 9.242619
4489 1/31/18 Troy Daniels PHO ... 19 13.25 16.174583
4490 1/31/18 Dragan Bender PHO ... 20 12.25 15.365854
4491 1/31/18 Isaiah Canaan PHO ... 4 6.50 32.926829
4492 1/31/18 Tyler Ulis PHO ... 10 5.00 21.951220
[4493 rows x 8 columns]

Stack Exchange Network

Optimize mapping over large DF

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimize mapping over large DF

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions