Imrove performance when updating DataFrame rows based on complex criteria

Question 1

My question got rejected the last time so I am trying a better approach to getting a solution:

df.head:
 predicted_u4 u_2_5_weight predicted_o2.5_n predicted_score_difference dnb_weight total_score o_1_5_weight predicted_total_score away_score predicted_bttsu2.5_n home_score btts_u_2_5_weight result_match selection_n o_2_5_weight btts_o_2_5_weight predicted_bttso2.5_n win_weight predicted_result predicted_btts_n selection_match_n u_4_5_weight btts_weight predicted_u2.5_n result
0 0.530389 0.4 0.697917 0.881006 0.7 4 3.2 3.540952 3 0.08308 1 0.4 no match O 2.5 (untested) 0.40 0.40 0.536766 1.1 home 0.618518 match 0.4 0.40 0.291228 away
1 0.530389 0.4 0.697917 0.881006 0.7 4 3.2 3.540952 3 0.08308 1 0.4 no match O 2.5 (untested) 0.40 0.40 0.536766 1.1 home 0.618518 match 0.4 0.40 0.291228 away
2 0.743486 0.4 0.477249 0.229046 0.7 2 3.2 2.458867 0 0.13194 2 0.4 match U 2.5 (untested) 0.48 0.40 0.397920 1.1 home 0.531042 match 0.4 0.54 0.529926 home
3 0.743486 0.4 0.477249 0.229046 0.7 2 3.2 2.458867 0 0.13194 2 0.4 match U 2.5 (untested) 0.48 0.40 0.397920 1.1 home 0.531042 match 0.4 0.54 0.529926 home
4 0.752334 0.4 0.532446 0.357271 0.7 1 3.2 2.599825 0 0.06794 1 0.4 match U 2.5 (untested) 0.54 0.44 0.435302 1.1 home 0.516939 match 0.4 0.52 0.480485 home
df.shape[0]:
2437086

I am trying a function to update the rows using:

def selection_n(row):
 if (row["win_weight"] == 1.1 or row["btts_o_2_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "W & BTTS O 2.5 (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "W & BTTS O 2.5"
 if (row["win_weight"] == 1.1 or row["btts_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_btts_n"] > row["btts_weight"]:
 return "W & BTTS (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_btts_n"] > row["btts_weight"]:
 return "W & BTTS"
 if (row["win_weight"] == 1.1 or row["o_2_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_o2.5_n"] > row["o_2_5_weight"]:
 return "W & O 2.5 (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_o2.5_n"] > row["o_2_5_weight"]:
 return "W & O 2.5"
 if (row["win_weight"] == 1.1 or row["o_1_5_weight"] == 3.2) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_total_score"] > row["o_1_5_weight"]:
 return "W & O 1.5 (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_total_score"] > row["o_1_5_weight"]:
 return "W & O 1.5"
 if (row["win_weight"] == 1.1 or row["u_2_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_u2.5_n"] > row["u_2_5_weight"]:
 return "W & U 2.5 (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_u2.5_n"] > row["u_2_5_weight"]:
 return "W & U 2.5"
 if (row["win_weight"] == 1.1 or row["u_4_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_u4"] > row["u_4_5_weight"]:
 return "W & U 4.5 (untested)"
 elif row["predicted_score_difference"] > row["win_weight"] and row["predicted_u4"] > row["u_4_5_weight"]:
 return "W & U 4.5"
 if row["win_weight"] == 1.1 and row["predicted_score_difference"] > row["win_weight"]:
 return "W (untested)"
 elif row["predicted_score_difference"] > row["win_weight"]:
 return "W"
 if row["o_2_5_weight"] == 0.4 and row["predicted_o2.5_n"] > row["o_2_5_weight"]:
 return "O 2.5 (untested)"
 elif row["predicted_o2.5_n"] > row["o_2_5_weight"]:
 return "O 2.5"
 if row["btts_o_2_5_weight"] == 0.4 and row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "BTTS O 2.5 (untested)"
 elif row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "BTTS O 2.5"
 if row["btts_weight"] == 0.4 and row["predicted_btts_n"] > row["btts_weight"]:
 return "BTTS (untested)"
 elif row["predicted_btts_n"] > row["btts_weight"]:
 return "BTTS"
 if row["u_2_5_weight"] == 0.4 and row["predicted_u2.5_n"] > row["u_2_5_weight"]:
 return "U 2.5 (untested)"
 elif row["predicted_u2.5_n"] > row["u_2_5_weight"]:
 return "U 2.5"
 if row["dnb_weight"] == 0.7 and row["dnb_weight"] < row["predicted_score_difference"] < row["win_weight"]:
 return "DNB (untested)"
 elif row["dnb_weight"] < row["predicted_score_difference"] < row["win_weight"]:
 return "DNB"
 if row["u_4_5_weight"] == 0.4 and row["predicted_u4"] > row["u_4_5_weight"]:
 return "U 4.5 (untested)"
 elif row["predicted_u4"] > row["u_4_5_weight"]:
 return "U 4.5"
 if (row["o_1_5_weight"] == 0.4 or row["u_4_5_weight"] == 0.4) and row["predicted_total_score"] > row["o_1_5_weight"] and row["predicted_u4"] > row["u_4_5_weight"]:
 return "O 1.5 and U 4.5 (untested)"
 elif row["predicted_total_score"] > row["o_1_5_weight"] and row["predicted_u4"] > row["u_4_5_weight"]:
 return "O 1.5 and U 4.5"
 if row["btts_u_2_5_weight"] == 0.4 and row["predicted_bttsu2.5_n"] > row["btts_u_2_5_weight"]:
 return "U 2.5 & BTTS (untested)"
 elif row["btts_u_2_5_weight"] == 0.4 and row["predicted_bttsu2.5_n"] > row["btts_u_2_5_weight"]:
 return "U 2.5 & BTTS"
def selection_match_n(row):
 if pd.isna(row["home_score"]) or pd.isna(row["away_score"]):
 return "no_result"
 if pd.isnull(row["selection_n"]):
 return "no sel."
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["home_score"] > 0 and row["away_score"] > 0 and row["total_score"] > 2 and (row["selection_n"] == 'W & BTTS O 2.5' or row["selection_n"] == 'W & BTTS O 2.5 (untested)'):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["home_score"] > 0 and row["away_score"] > 0 and (row["selection_n"] == 'W & BTTS' or row["selection_n"] == 'W & BTTS (untested)'):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] > 2 and (row["selection_n"] == 'W & O 2.5' or row["selection_n"] == 'W & O 2.5 (untested)'):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] > 1 and (row["selection_n"] == 'W & O 1.5' or row["selection_n"] == 'W & O 1.5 (untested)'):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] < 3 and (row["selection_n"] == 'W & U 2.5' or row["selection_n"] == 'W & U 2.5 (untested)'):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] < 5 and (row["selection_n"] == "W & U 4.5" or row["selection_n"] == "W & U 4.5 (untested)"):
 return "match"
 if row["result_match"] == 'match' and row["predicted_result"] != 'draw' and (row["selection_n"] == "W" or row["selection_n"] == "W (untested)"):
 return "match"
 if row["total_score"] > 2 and (row["selection_n"] == 'O 2.5' or row["selection_n"] == 'O 2.5 (untested)'):
 return "match"
 if row["home_score"] > 0 and row["away_score"] > 0 and row["total_score"] > 2 and (row["selection_n"] == 'BTTS O 2.5' or row["selection_n"] == 'BTTS O 2.5 (untested)'):
 return "match"
 if row["home_score"] > 0 and row["away_score"] > 0 and (row["selection_n"] == 'BTTS' or row["selection_n"] == 'BTTS (untested)'):
 return "match"
 if row["total_score"] < 3 and (row["selection_n"] == 'U 2.5' or row["selection_n"] == 'U 2.5 (untested)'):
 return "match"
 if (row["result_match"] == 'match' or row["result"] == 'draw' or row["predicted_result"] == 'draw') and (row["selection_n"] == "DNB" or row["selection_n"] == "DNB (untested)"):
 return "match"
 if row["total_score"] < 5 and (row["selection_n"] == 'U 4.5' or row["selection_n"] == 'U 4.5 (untested)'):
 return "match"
 if 1 < row["total_score"] < 5 and (row["selection_n"] == 'O 1.5 and U 4.5' or row["selection_n"] == 'O 1.5 and U 4.5 (untested)'):
 return "match"
 if row["home_score"] > 0 and row["away_score"] > 0 and row["total_score"] < 3 and (row["selection_n"] == 'U 2.5 & BTTS' or row["selection_n"] == 'U 2.5 & BTTS (untested)'):
 return "match"
 else:
 return "no match"
def selection_update_n(row):
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & BTTS O 2.5' or row["selection_n"] == 'W & BTTS O 2.5 (untested)'):
 if row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 row["win_weight"] += 0.02
 elif (row["home_score"] == 0 or row["away_score"] == 0) and row["total_score"] < 3:
 row["btts_o_2_5_weight"] += 0.02
 elif row["home_score"] == 0 or row["away_score"] == 0:
 row["btts_o_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & BTTS' or row["selection_n"] == 'W & BTTS (untested)') and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 if row["home_score"] > 0 and row["away_score"] > 0:
 row["win_weight"] += 0.02
 elif (row["home_score"] == 0 or row["away_score"] == 0):
 row["win_weight"] += 0.02
 row["btts_weight"] += 0.02
 elif row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & BTTS' or row["selection_n"] == 'W & BTTS (untested)') and row["result_match"] == 'match' and row["predicted_result"] != 'draw' and (row["home_score"] == 0 or row["away_score"] == 0):
 row["btts_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & O 2.5' or row["selection_n"] == 'W & O 2.5 (untested)') and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 if row["total_score"] > 2:
 row["win_weight"] += 0.02
 elif row["total_score"] < 3:
 row["win_weight"] += 0.02
 row["o_2_5_weight"] += 0.02
 elif row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & O 2.5' or row["selection_n"] == 'W & O 2.5 (untested)') and row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] < 3:
 row["o_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & O 1.5' or row["selection_n"] == 'W & O 1.5 (untested)') and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 if row["total_score"] > 1:
 row["win_weight"] += 0.02
 else:
 row["win_weight"] += 0.02
 row["o_1_5_weight"] += 0.02
 elif row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & O 1.5' or row["selection_n"] == 'W & O 1.5 (untested)') and row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] < 2:
 row["o_1_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & U 2.5' or row["selection_n"] == 'W & U 2.5 (untested)') and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 if row["total_score"] < 3:
 row["win_weight"] += 0.02
 else:
 row["win_weight"] += 0.02
 row["u_2_5_weight"] += 0.02
 elif row["selection_match_n"] == 'no match' and (row["selection_n"] == 'W & U 2.5' or row["selection_n"] == 'W & U 2.5 (untested)') and row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] > 2:
 row["u_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "W & U 4.5" or row["selection_n"] == "W & U 4.5 (untested)") and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 if row["total_score"] < 5:
 row["win_weight"] += 0.02
 else:
 row["win_weight"] += 0.02
 row["u_4_5_weight"] += 0.02
 elif row["selection_match_n"] == 'no match' and (row["selection_n"] == "W & U 4.5" or row["selection_n"] == "W & U 4.5 (untested)") and row["result_match"] == 'match' and row["predicted_result"] != 'draw' and row["total_score"] > 4:
 row["u_4_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "W" or row["selection_n"] == "W (untested)") and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 row["win_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "W" or row["selection_n"] == "W (untested)") and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 row["win_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'O 2.5' or row["selection_n"] == 'O 2.5 (untested)') and row["total_score"] < 3:
 row["o_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'BTTS O 2.5' or row["selection_n"] == 'BTTS O 2.5 (untested)') and (row["home_score"] == 0 or row["away_score"] == 0 or row["total_score"] < 3):
 row["btts_o_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'BTTS' or row["selection_n"] == 'BTTS (untested)') and (row["home_score"] == 0 or row["away_score"] == 0):
 row["btts_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'U 2.5' or row["selection_n"] == 'U 2.5 (untested)') and row["total_score"] > 2:
 row["u_2_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "DNB" or row["selection_n"] == "DNB (untested)") and row["predicted_result"] != 'draw' and (row["result_match"] != 'no match' or row["result"] != 'draw'):
 row["dnb_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'U 4.5' or row["selection_n"] == 'U 4.5 (untested)') and row["total_score"] > 4:
 row["u_4_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'O 1.5 and U 4.5' or row["selection_n"] == 'O 1.5 and U 4.5 (untested)') and (row["total_score"] < 2 or row["total_score"] > 4):
 row["o_1_5_weight"] += 0.02
 row["u_4_5_weight"] += 0.02
 if row["selection_match_n"] == 'no match' and (row["selection_n"] == 'U 2.5 & BTTS' or row["selection_n"] == 'U 2.5 & BTTS (untested)') and (row["home_score"] == 0 or row["away_score"] == 0 or row["total_score"] > 2):
 row["btts_u_2_5_weight"] += 0.0
 return row

I am trying various approaches to improve the performance. Note, I am unable to use modin as I use Pycharm environment and I get init errors with either ray or dask hence unable to exploit multicore processing.

timeit.timeit(lambda: df.apply(selection_n, axis=1), number=10):
173.67167650000192
timeit.timeit(lambda: df.apply(selection_match_n, axis=1), number=10):
112.6237928000046
timeit.timeit(lambda: df.apply(selection_update_n, axis=1), number=10):
160.64576310000848

If you want to know what I am trying to accomplish here, There are selections in the selection_n column that gets updated based on the _weight columns these weights are checked for matches and all no_matches need to be updated and then checked again. This loop continues until all the "no match" entries get confirmed to "match" Since this dataframe is a big one, the loop gets very time consuming (last time, the loop took 4 days to complete)

Based on "law of increasing returns" I have tried this approach which tends to work better as the no_match rows reduce every loop so .apply would work faster:

loop_counter = 0
while (df["selection_match_n"] == "no match").any():
 start_time = time.time()
 loop_counter += 1
 print(f"Iteration: {loop_counter}")
 df['selection_n'] = df.swifter.apply(selection_n, axis=1)
 # Splitting the DataFrame
 no_match_rows = df[df['selection_match_n'] == 'no match']
 other_rows = df[df['selection_match_n'] != 'no match']
 # Process the no_match_rows DataFrame
 no_match_rows['selection_n'] = no_match_rows.swifter.apply(selection_n, axis=1)
 no_match_rows = no_match_rows.swifter.apply(selection_update_n, axis=1)
 no_match_rows['selection_match_n'] = no_match_rows.swifter.apply(selection_match_n, axis=1)
 print('Count of Selection: no_match rows:', (no_match_rows["selection_match_n"] == "no match").sum())
 # Concatenate the modified no_match_rows back with other_rows
 df = pd.concat([other_rows, no_match_rows])

I have tried swifter which does not do much.

I am wondering what is the best way to improve performance of these functions. I think vectorisation can be the best use coupled with Cython (if that's possible).

Question 2

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 3

I find the code hard to read.
This may start with having no clue what it is supposed to achieve when I start reading:
Document your code. In the code.

I don't like

 if ‹condition›:
 ...
 return ‹whatever›
 else:
 ‹first statement when not ‹condition››
 ...

- just drop the else.

I see a lot of repeated accesses to row improving neither readability nor speed.
Try introducing variables.
The "untested" return branches seem to be controlled by one condition more than the paired return without " (untested)":
Check the common condition first (and once, only)

def selection_n(row):
 win_weight_1_1 = row["win_weight"] == 1.1
 btts_o_2_5_weight = row["btts_o_2_5_weight"]
 btts_o_2_5_weight_0_4 = btts_o_2_5_weight == 0.4
 predicted_score_difference = row["predicted_score_difference"]
 predicted_score_difference_greater = predicted_score_difference > win_weight
 predicted_bttso2_5_n_greater = row["predicted_bttso2.5_n"] > btts_o_2_5_weight
 o_2_5_weight = row["o_2_5_weight"]
 predicted_o2_5_n_greater = row["predicted_o2.5_n"] > o_2_5_weight
 o_2_5_weight_0_4 = o_2_5_weight == 0.4
 btts_weight = row["btts_weight"]
 btts_weight_0_4 = btts_weight == 0.4
 u_4_5_weight_0_4 = row["u_4_5_weight"] == 0.4
 predicted_btts_n_greater = row["predicted_btts_n"] > btts_weight
 predicted_u2_5_n_greater = row["predicted_u2.5_n"] > row["u_2_5_weight"]
 predicted_u4_greater = row["predicted_u4"] > row["u_4_5_weight"]
 if row["predicted_total_score"] > row["o_1_5_weight"]:
 if predicted_bttso2_5_n_greater:
 if win_weight_1_1 or btts_o_2_5_weight_0_4:
 return "W & BTTS O 2.5 (untested)"
 return "W & BTTS O 2.5"
 if predicted_btts_n_greater:
 if win_weight_1_1 or btts_weight_0_4:
 return "W & BTTS (untested)"
 return "W & BTTS"
 if predicted_o2_5_n_greater:
 if win_weight_1_1 or o_2_5_weight:
 return "W & O 2.5 (untested)"
 return "W & O 2.5"
 if predicted_total_score_greater:
 if win_weight_1_1 or row["o_1_5_weight"] == 3.2:
 return "W & O 1.5 (untested)"
 return "W & O 1.5"
 if predicted_u2_5_n_greater:
 if win_weight_1_1 or row["u_2_5_weight"] == 0.4:
 return "W & U 2.5 (untested)"
 return "W & U 2.5"
 if predicted_u4_greater: 
 if win_weight_1_1 or u_4_5_weight_0_4:
 return "W & U 4.5 (untested)"
 return "W & U 4.5"
 if win_weight_1_1:
 return "W (untested)"
 return "W"
 # row["predicted_total_score"] <= row["o_1_5_weight"]
 if predicted_bttso2_5_n_greater:
 if o_2_5_weight:
 return "O 2.5 (untested)"
 return "O 2.5"
 if predicted_bttso2_5_n_greater:
 if btts_o_2_5_weight_0_4:
 return "BTTS O 2.5 (untested)"
 return "BTTS O 2.5"
 if predicted_btts_n_greater:
 if btts_weight_0_4:
 return "BTTS (untested)"
 return "BTTS"
 if predicted_u2_5_n_greater:
 if row["u_2_5_weight"] == 0.4:
 return "U 2.5 (untested)"
 return "U 2.5"
 if row["dnb_weight"] < predicted_score_difference < row["win_weight"]:
 if row["dnb_weight"] == 0.7:
 return "DNB (untested)"
 return "DNB"
 if predicted_u4_greater:
 if u_4_5_weight_0_4:
 return "U 4.5 (untested)"
 return "U 4.5"
 if predicted_total_score_greater and predicted_u4_greater:
 if row["o_1_5_weight"] == 0.4 or u_4_5_weight_0_4:
 return "O 1.5 and U 4.5 (untested)"
 return "O 1.5 and U 4.5"
 if row["predicted_bttsu2.5_n"] > row["btts_u_2_5_weight"]:
 if row["btts_u_2_5_weight"] == 0.4:
 return "U 2.5 & BTTS (untested)"
 return "U 2.5 & BTTS"

Potential bugs in selection_n():
- in the question, the last two conditions are exactly the same; I took the liberty to guess the intention
- if none of the conditions match, it returns None

a helper function reduces repetition and bulk:

def or_untested(selection, literal):
 """ return selection matches literal or its extension with " (untested)". """
 return selection == literal or selection == literal + " (untested)"

selection_match_n() seems to contain a lot of checks subsumed by later, less narrow constraints:

def selection_match_n(row):
 if pd.isna(row["home_score"]) or pd.isna(row["away_score"]):
 return "no_result"
 selection = row["selection_n"]
 if pd.isnull(selection):
 return "no sel."
 match = row["result_match"] == 'match'
 predicted_draw = row["predicted_result"] != 'draw'
 total_score = row["total_score"]
 if (match and predicted_draw
 and selection == 'W' or selection.starts_with("W ")):
# if row["home_score"] > 0
# and row["away_score"] > 0
# and or_untested(selection, 'W & BTTS'):
## if total_score > 2
## and or_untested(selection, 'W & BTTS O 2.5'):
## return "match"
# return "match"
# if total_score > 1
# if (total_score > 2
# and or_untested(selection, 'W & O 2.5')):
# return "match"
# if or_untested(selection, 'W & O 1.5'):
# return "match"
# if total_score < 5
# if (total_score < 3
# and or_untested(selection, 'W & U 2.5')):
# return "match"
# if or_untested(selection, "W & U 4.5"):
# return "match"
 return "match"
 
 if total_score > 2 and or_untested(selection, 'O 2.5'):
 return "match"
 both_scored = row["home_score"] > 0 and row["away_score"] > 0
 if both_scored:
# if total_score > 2 and or_untested(selection, 'BTTS O 2.5'):
# return "match"
 if or_untested(selection, 'BTTS'):
 return "match"
 if total_score < 3 and or_untested(selection, 'U 2.5'):
 return "match"
 if ((match or row["result"] == 'draw' or predicted_draw)
 and or_untested(selection, "DNB")):
 return "match"
 if total_score < 5:
 if or_untested(selection, 'U 4.5'):
 return "match"
 if 1 < total_score and or_untested(selection, 'O 1.5 and U 4.5'):
 return "match"
 if both_scored and total_score < 3 and or_untested(selection, 'U 2.5'):
 return "match"
 return "no match"

selection_update_n()

row is not modified unless row["selection_match_n"] == 'no match': return upfront if !=.
if row["total_score"] > 2: ... elif row["total_score"] < 3: is weird: v <= 2 implies v < 3

 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "W" or row["selection_n"] == "W (untested)") and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 row["win_weight"] += 0.02

seems to be duplicated:
delete one, change to += 0.04 if appropriate.

def selection_update_n(row):
 if row["selection_match_n"] != 'no match':
 return row
 selection = row["selection_n"]
 match = row["result_match"] == 'match'
 no_match = row["result_match"] == 'no match'
 ne_draw = row["predicted_result"] != 'draw'
 no_match_ne_draw = no_match and ne_draw
 one_score_zero = row["home_score"] == 0 or row["away_score"] == 0
 total_score = row["total_score"]
 if selection[0] == 'W':
 if or_untested(selection, 'W & BTTS O 2.5'):
 if no_match and ne_draw:
 row["win_weight"] += 0.02
 elif one_score_zero:
 # and total_score < 3:
 row["btts_o_2_5_weight"] += 0.02
 # elif one_score_zero:
 # row["btts_o_2_5_weight"] += 0.02
 if or_untested(selection, 'W & BTTS') and ne_draw:
 if no_match and row["home_score"] >= 0 and row["away_score"] >= 0:
 row["win_weight"] += 0.02
 if one_score_zero:
 row["btts_weight"] += 0.02
 if or_untested(selection, 'W & O 2.5') and no_match_ne_draw:
 if total_score > 2:
 row["win_weight"] += 0.02
 elif total_score < 3: #?!
 row["win_weight"] += 0.02
 row["o_2_5_weight"] += 0.02
 elif (or_untested(selection, 'W & O 2.5')
 and match and ne_draw and total_score < 3):
 row["o_2_5_weight"] += 0.02
 if (or_untested(selection, 'W & O 1.5') and no_match_ne_draw):
 row["win_weight"] += 0.02
 if total_score <= 1:
 row["o_1_5_weight"] += 0.02
 elif (or_untested(selection, 'W & O 1.5') and match
 and ne_draw and total_score < 2):
 row["o_1_5_weight"] += 0.02
 if (or_untested(selection, 'W & U 2.5') and no_match_ne_draw):
 row["win_weight"] += 0.02
 if total_score >= 3:
 row["win_weight"] += 0.02
 elif (or_untested(selection, 'W & U 2.5') and match
 and ne_draw and total_score > 2):
 row["u_2_5_weight"] += 0.02
 if or_untested(selection, "W & U 4.5"):
 if no_match_ne_draw:
 row["win_weight"] += 0.02
 if total_score >= 5:
 row["u_4_5_weight"] += 0.02
 elif match and ne_draw and total_score > 4:
 row["u_4_5_weight"] += 0.02
 if or_untested(selection, "W") and no_match_ne_draw:
 row["win_weight"] += 0.02
 elif selection.starts_with('BTTS'):
 if or_untested(selection, 'BTTS') and one_score_zero:
 row["btts_weight"] += 0.02
 elif (or_untested(selection, 'BTTS O 2.5')
 and (one_score_zero or total_score < 3)):
 row["btts_o_2_5_weight"] += 0.02
 elif selection.starts_with('U '):
 if selection.starts_with('U 2.5'):
 if (or_untested(selection, 'U 2.5 & BTTS')
 and (one_score_zero or total_score > 2)):
 row["btts_u_2_5_weight"] += 0.0
 if or_untested(selection, total_score > 2):
 row["u_2_5_weight"] += 0.02
 if or_untested(selection, 'U 4.5') and total_score > 4:
 row["u_4_5_weight"] += 0.02
 elif selection.starts_with('O '):
 if or_untested(selection, 'O 2.5') and total_score < 3:
 row["o_2_5_weight"] += 0.02
 elif (or_untested(selection, 'O 1.5 and U 4.5')
 and (total_score < 2 or 4 < total_score)):
 row["o_1_5_weight"] += 0.02
 row["u_4_5_weight"] += 0.02
 elif (or_untested(selection, "DNB") and ne_draw
 and (row["result_match"] != 'no match' or row["result"] != 'draw')):
 row["dnb_weight"] += 0.02
 return row

Above, I went to some length in reducing bulk to improve readability.
I still don't see rhythm or rhyme.

Question 4

(Such refactoring should be done in presence of a test scaffold and using an environment supporting refactorings such as extract variable.)

Question 5

Do everything that @greybeard says - and then vectorise. You have a selection_n function that is the subject of an apply; the apply needs to go away and the logic in selection_n needs to be pulled out one dimension. I'm certainly not going to demonstrate how to do this in full because your logic is quite long, but as one example let's look at

df.apply(selection_n, axis=1)

with the very first condition:

 if (row["win_weight"] == 1.1 or row["btts_o_2_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "W & BTTS O 2.5 (untested)"

That becomes:

df.loc[
 (
 (df['win_weight' == 1.1) | (df['btts_o_2_5_weight'] == 0.4)
 )
 & (
 df['predicted_score_difference'] > df['win_weight']
 )
 & (
 df['predicted_bttso2.5_n'] > df['btts_o_2_5_weight']
 ),
 'selection_n',
] = 'W & BTTS O 2.5 (untested)'

Rewritten like this your speed should increase.

Question 6

Can you please provide an example for a if-else combo? Since the function is if-else, recreating that from the example will be relatively simpler

Question 7

@PyNoob Since an if-else implies two different conditions, there would be two different assignments. The first would evaluate condition a and the second would evaluate the inverse, condition ~a.

greybeard greybeard 7,4013 gold badges21 silver badges55 bronze badges · Answer 1 · 2023-08-14 08:20:10Z

I find the code hard to read.
This may start with having no clue what it is supposed to achieve when I start reading:
Document your code. In the code.

I don't like

 if ‹condition›:
 ...
 return ‹whatever›
 else:
 ‹first statement when not ‹condition››
 ...

- just drop the else.

I see a lot of repeated accesses to row improving neither readability nor speed.
Try introducing variables.
The "untested" return branches seem to be controlled by one condition more than the paired return without " (untested)":
Check the common condition first (and once, only)

def selection_n(row):
 win_weight_1_1 = row["win_weight"] == 1.1
 btts_o_2_5_weight = row["btts_o_2_5_weight"]
 btts_o_2_5_weight_0_4 = btts_o_2_5_weight == 0.4
 predicted_score_difference = row["predicted_score_difference"]
 predicted_score_difference_greater = predicted_score_difference > win_weight
 predicted_bttso2_5_n_greater = row["predicted_bttso2.5_n"] > btts_o_2_5_weight
 o_2_5_weight = row["o_2_5_weight"]
 predicted_o2_5_n_greater = row["predicted_o2.5_n"] > o_2_5_weight
 o_2_5_weight_0_4 = o_2_5_weight == 0.4
 btts_weight = row["btts_weight"]
 btts_weight_0_4 = btts_weight == 0.4
 u_4_5_weight_0_4 = row["u_4_5_weight"] == 0.4
 predicted_btts_n_greater = row["predicted_btts_n"] > btts_weight
 predicted_u2_5_n_greater = row["predicted_u2.5_n"] > row["u_2_5_weight"]
 predicted_u4_greater = row["predicted_u4"] > row["u_4_5_weight"]
 if row["predicted_total_score"] > row["o_1_5_weight"]:
 if predicted_bttso2_5_n_greater:
 if win_weight_1_1 or btts_o_2_5_weight_0_4:
 return "W & BTTS O 2.5 (untested)"
 return "W & BTTS O 2.5"
 if predicted_btts_n_greater:
 if win_weight_1_1 or btts_weight_0_4:
 return "W & BTTS (untested)"
 return "W & BTTS"
 if predicted_o2_5_n_greater:
 if win_weight_1_1 or o_2_5_weight:
 return "W & O 2.5 (untested)"
 return "W & O 2.5"
 if predicted_total_score_greater:
 if win_weight_1_1 or row["o_1_5_weight"] == 3.2:
 return "W & O 1.5 (untested)"
 return "W & O 1.5"
 if predicted_u2_5_n_greater:
 if win_weight_1_1 or row["u_2_5_weight"] == 0.4:
 return "W & U 2.5 (untested)"
 return "W & U 2.5"
 if predicted_u4_greater: 
 if win_weight_1_1 or u_4_5_weight_0_4:
 return "W & U 4.5 (untested)"
 return "W & U 4.5"
 if win_weight_1_1:
 return "W (untested)"
 return "W"
 # row["predicted_total_score"] <= row["o_1_5_weight"]
 if predicted_bttso2_5_n_greater:
 if o_2_5_weight:
 return "O 2.5 (untested)"
 return "O 2.5"
 if predicted_bttso2_5_n_greater:
 if btts_o_2_5_weight_0_4:
 return "BTTS O 2.5 (untested)"
 return "BTTS O 2.5"
 if predicted_btts_n_greater:
 if btts_weight_0_4:
 return "BTTS (untested)"
 return "BTTS"
 if predicted_u2_5_n_greater:
 if row["u_2_5_weight"] == 0.4:
 return "U 2.5 (untested)"
 return "U 2.5"
 if row["dnb_weight"] < predicted_score_difference < row["win_weight"]:
 if row["dnb_weight"] == 0.7:
 return "DNB (untested)"
 return "DNB"
 if predicted_u4_greater:
 if u_4_5_weight_0_4:
 return "U 4.5 (untested)"
 return "U 4.5"
 if predicted_total_score_greater and predicted_u4_greater:
 if row["o_1_5_weight"] == 0.4 or u_4_5_weight_0_4:
 return "O 1.5 and U 4.5 (untested)"
 return "O 1.5 and U 4.5"
 if row["predicted_bttsu2.5_n"] > row["btts_u_2_5_weight"]:
 if row["btts_u_2_5_weight"] == 0.4:
 return "U 2.5 & BTTS (untested)"
 return "U 2.5 & BTTS"

Potential bugs in selection_n():
- in the question, the last two conditions are exactly the same; I took the liberty to guess the intention
- if none of the conditions match, it returns None

a helper function reduces repetition and bulk:

def or_untested(selection, literal):
 """ return selection matches literal or its extension with " (untested)". """
 return selection == literal or selection == literal + " (untested)"

selection_match_n() seems to contain a lot of checks subsumed by later, less narrow constraints:

def selection_match_n(row):
 if pd.isna(row["home_score"]) or pd.isna(row["away_score"]):
 return "no_result"
 selection = row["selection_n"]
 if pd.isnull(selection):
 return "no sel."
 match = row["result_match"] == 'match'
 predicted_draw = row["predicted_result"] != 'draw'
 total_score = row["total_score"]
 if (match and predicted_draw
 and selection == 'W' or selection.starts_with("W ")):
# if row["home_score"] > 0
# and row["away_score"] > 0
# and or_untested(selection, 'W & BTTS'):
## if total_score > 2
## and or_untested(selection, 'W & BTTS O 2.5'):
## return "match"
# return "match"
# if total_score > 1
# if (total_score > 2
# and or_untested(selection, 'W & O 2.5')):
# return "match"
# if or_untested(selection, 'W & O 1.5'):
# return "match"
# if total_score < 5
# if (total_score < 3
# and or_untested(selection, 'W & U 2.5')):
# return "match"
# if or_untested(selection, "W & U 4.5"):
# return "match"
 return "match"
 
 if total_score > 2 and or_untested(selection, 'O 2.5'):
 return "match"
 both_scored = row["home_score"] > 0 and row["away_score"] > 0
 if both_scored:
# if total_score > 2 and or_untested(selection, 'BTTS O 2.5'):
# return "match"
 if or_untested(selection, 'BTTS'):
 return "match"
 if total_score < 3 and or_untested(selection, 'U 2.5'):
 return "match"
 if ((match or row["result"] == 'draw' or predicted_draw)
 and or_untested(selection, "DNB")):
 return "match"
 if total_score < 5:
 if or_untested(selection, 'U 4.5'):
 return "match"
 if 1 < total_score and or_untested(selection, 'O 1.5 and U 4.5'):
 return "match"
 if both_scored and total_score < 3 and or_untested(selection, 'U 2.5'):
 return "match"
 return "no match"

selection_update_n()

row is not modified unless row["selection_match_n"] == 'no match': return upfront if !=.
if row["total_score"] > 2: ... elif row["total_score"] < 3: is weird: v <= 2 implies v < 3

 if row["selection_match_n"] == 'no match' and (row["selection_n"] == "W" or row["selection_n"] == "W (untested)") and row["result_match"] == 'no match' and row["predicted_result"] != 'draw':
 row["win_weight"] += 0.02

seems to be duplicated:
delete one, change to += 0.04 if appropriate.

def selection_update_n(row):
 if row["selection_match_n"] != 'no match':
 return row
 selection = row["selection_n"]
 match = row["result_match"] == 'match'
 no_match = row["result_match"] == 'no match'
 ne_draw = row["predicted_result"] != 'draw'
 no_match_ne_draw = no_match and ne_draw
 one_score_zero = row["home_score"] == 0 or row["away_score"] == 0
 total_score = row["total_score"]
 if selection[0] == 'W':
 if or_untested(selection, 'W & BTTS O 2.5'):
 if no_match and ne_draw:
 row["win_weight"] += 0.02
 elif one_score_zero:
 # and total_score < 3:
 row["btts_o_2_5_weight"] += 0.02
 # elif one_score_zero:
 # row["btts_o_2_5_weight"] += 0.02
 if or_untested(selection, 'W & BTTS') and ne_draw:
 if no_match and row["home_score"] >= 0 and row["away_score"] >= 0:
 row["win_weight"] += 0.02
 if one_score_zero:
 row["btts_weight"] += 0.02
 if or_untested(selection, 'W & O 2.5') and no_match_ne_draw:
 if total_score > 2:
 row["win_weight"] += 0.02
 elif total_score < 3: #?!
 row["win_weight"] += 0.02
 row["o_2_5_weight"] += 0.02
 elif (or_untested(selection, 'W & O 2.5')
 and match and ne_draw and total_score < 3):
 row["o_2_5_weight"] += 0.02
 if (or_untested(selection, 'W & O 1.5') and no_match_ne_draw):
 row["win_weight"] += 0.02
 if total_score <= 1:
 row["o_1_5_weight"] += 0.02
 elif (or_untested(selection, 'W & O 1.5') and match
 and ne_draw and total_score < 2):
 row["o_1_5_weight"] += 0.02
 if (or_untested(selection, 'W & U 2.5') and no_match_ne_draw):
 row["win_weight"] += 0.02
 if total_score >= 3:
 row["win_weight"] += 0.02
 elif (or_untested(selection, 'W & U 2.5') and match
 and ne_draw and total_score > 2):
 row["u_2_5_weight"] += 0.02
 if or_untested(selection, "W & U 4.5"):
 if no_match_ne_draw:
 row["win_weight"] += 0.02
 if total_score >= 5:
 row["u_4_5_weight"] += 0.02
 elif match and ne_draw and total_score > 4:
 row["u_4_5_weight"] += 0.02
 if or_untested(selection, "W") and no_match_ne_draw:
 row["win_weight"] += 0.02
 elif selection.starts_with('BTTS'):
 if or_untested(selection, 'BTTS') and one_score_zero:
 row["btts_weight"] += 0.02
 elif (or_untested(selection, 'BTTS O 2.5')
 and (one_score_zero or total_score < 3)):
 row["btts_o_2_5_weight"] += 0.02
 elif selection.starts_with('U '):
 if selection.starts_with('U 2.5'):
 if (or_untested(selection, 'U 2.5 & BTTS')
 and (one_score_zero or total_score > 2)):
 row["btts_u_2_5_weight"] += 0.0
 if or_untested(selection, total_score > 2):
 row["u_2_5_weight"] += 0.02
 if or_untested(selection, 'U 4.5') and total_score > 4:
 row["u_4_5_weight"] += 0.02
 elif selection.starts_with('O '):
 if or_untested(selection, 'O 2.5') and total_score < 3:
 row["o_2_5_weight"] += 0.02
 elif (or_untested(selection, 'O 1.5 and U 4.5')
 and (total_score < 2 or 4 < total_score)):
 row["o_1_5_weight"] += 0.02
 row["u_4_5_weight"] += 0.02
 elif (or_untested(selection, "DNB") and ne_draw
 and (row["result_match"] != 'no match' or row["result"] != 'draw')):
 row["dnb_weight"] += 0.02
 return row

Above, I went to some length in reducing bulk to improve readability.
I still don't see rhythm or rhyme.

(Such refactoring should be done in presence of a test scaffold and using an environment supporting refactorings such as extract variable.)

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 2 · 2023-08-15 12:57:04Z

Do everything that @greybeard says - and then vectorise. You have a selection_n function that is the subject of an apply; the apply needs to go away and the logic in selection_n needs to be pulled out one dimension. I'm certainly not going to demonstrate how to do this in full because your logic is quite long, but as one example let's look at

df.apply(selection_n, axis=1)

with the very first condition:

 if (row["win_weight"] == 1.1 or row["btts_o_2_5_weight"] == 0.4) and row["predicted_score_difference"] > row["win_weight"] and row["predicted_bttso2.5_n"] > row["btts_o_2_5_weight"]:
 return "W & BTTS O 2.5 (untested)"

That becomes:

df.loc[
 (
 (df['win_weight' == 1.1) | (df['btts_o_2_5_weight'] == 0.4)
 )
 & (
 df['predicted_score_difference'] > df['win_weight']
 )
 & (
 df['predicted_bttso2.5_n'] > df['btts_o_2_5_weight']
 ),
 'selection_n',
] = 'W & BTTS O 2.5 (untested)'

Rewritten like this your speed should increase.

Can you please provide an example for a if-else combo? Since the function is if-else, recreating that from the example will be relatively simpler
@PyNoob Since an if-else implies two different conditions, there would be two different assignments. The first would evaluate condition a and the second would evaluate the inverse, condition ~a.

Stack Exchange Network

Imrove performance when updating DataFrame rows based on complex criteria

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Imrove performance when updating DataFrame rows based on complex criteria

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions