1
\$\begingroup\$

I am using RandomForestRegressor to predict a set of soccer matches to predict the number of goals scored for a given match/set of matches is as below:

 dte.head()
 
 HomeTeam
 AwayTeam B365H B365D B365A FTHG FTAG
-- -------------- ---------- ------- ------- ------- ------ ------
 0 Leeds Chelsea 4.8 3.9 1.68 nan nan
 1 Crystal Palace West Brom 2.28 3 3.4 nan nan
 2 Everton Burnley 1.95 3.2 4.2 nan nan
 3 Fulham Man City 10.5 5 1.3 nan nan
 4 Southampton Brighton 2.9 3.05 2.55 nan nan
dtr.head()
 HomeTeam AwayTeam FTHG FTAG
-- -------------- ----------- ------ ------
 0 Fulham Arsenal 0 3
 1 Crystal Palace Southampton 1 0
 2 Liverpool Leeds 4 3
 3 West Ham Newcastle 0 2
 4 West Brom Leicester 0 3
def encode_features(df_train, df_test):
 features = ['HomeTeam', 'AwayTeam']
 df_combined = pd.concat([df_train[features], df_test[features]])
 for feature in features:
 le = preprocessing.LabelEncoder()
 le = le.fit(df_combined[feature])
 df_train[feature] = le.transform(df_train[feature])
 df_test[feature] = le.transform(df_test[feature])
 return df_train, df_test
dtr, dte = encode_features(dtr, dte)
dtr_g = dtr.loc[:, dte.columns.intersection(
 ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'B365H', 'B365D', 'B365A'])]
dte_g = dte.loc[:, dte.columns.intersection(
 ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'B365H', 'B365D', 'B365A'])]
dtr_g_h = dtr_g.drop(['FTAG'], axis=1)
dtr_g_a = dtr_g.drop(['FTHG'], axis=1)
dte_g = dte_g.drop(['FTHG', 'FTAG'], axis=1)
dte_g = dte_g.dropna()
# Encoding string features
def encode_features(df_train, df_test):
 features = ['HomeTeam', 'AwayTeam']
 df_combined = pd.concat([df_train[features], df_test[features]])
 for feature in features:
 le = preprocessing.LabelEncoder()
 le = le.fit(df_combined[feature])
 df_train[feature] = le.transform(df_train[feature])
 df_test[feature] = le.transform(df_test[feature])
 return df_train, df_test
dtr, dte = encode_features(dtr, dte)
# After Encoding
dte_g.head()
 HomeTeam AwayTeam B365H B365D B365A Month Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- -----------------
 0 271 114 4.8 3.9 1.68 3 2
 1 134 495 2.28 3 3.4 3 2
 2 170 88 1.95 3.2 4.2 3 2
 3 194 297 10.5 5 1.3 3 2
 4 429 83 2.9 3.05 2.55 3 1
dtr_g_h.head()
 HomeTeam AwayTeam B365H B365D B365A Month FTHG Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- ------ -----------------
 0 194 33 6 4.33 1.53 9 0 1
 1 134 428 3.1 3.25 2.37 9 1 1
 2 282 271 1.28 6 9.5 9 4 1
 3 497 326 2.15 3.4 3.4 9 0 1
 4 496 273 3.8 3.6 1.95 9 0 1
dtr_g_a.head()
 HomeTeam AwayTeam B365H B365D B365A Month FTAG Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- ------ -----------------
 0 194 33 6 4.33 1.53 9 3 1
 1 134 428 3.1 3.25 2.37 9 0 1
 2 282 271 1.28 6 9.5 9 3 1
 3 497 326 2.15 3.4 3.4 9 2 1
 4 496 273 3.8 3.6 1.95 9 3 1
# Predicing Goals
X_g_h = dtr_g_h.drop(['FTHG'], axis=1)
y_g_h = dtr_g_h['FTHG']
X_g_a = dtr_g_a.drop(['FTAG'], axis=1)
y_g_a = dtr_g_a['FTAG']
print("Splitting for Home")
train_X_g_h, val_X_g_h, train_y_g_h, val_y_g_h = train_test_split(X_g_h, y_g_h, test_size=0.2, random_state=1)
print("Splitting for away goals")
train_X_g_a, val_X_g_a, train_y_g_a, val_y_a = train_test_split(X_g_a, y_g_a, test_size=0.2, random_state=1)
print("Running Random Forest model")
rf_model_on_full_data_g_h = RandomForestRegressor()
rf_model_on_full_data_g_a = RandomForestRegressor()
print("Fitting for home goals")
rf_model_on_full_data_g_h.fit(X_g_h, y_g_h)
rf_model_on_full_data_g_a.fit(X_g_a, y_g_a)
print("Predicting goals for Home Team")
test_preds_h_g = rf_model_on_full_data_g_h.predict(dte_g)
print("Predicting goals for Away Team")
test_preds_a_g = rf_model_on_full_data_g_a.predict(dte_g)
result = pd.DataFrame({
 'League': dte_input.League,
 'Match DateTime': dte_input.DateTime,
 'Home Team': dte_input.HomeTeam,
 'Away Team': dte_input.AwayTeam,
 'Full time Home Goals': test_preds_h_g,
 'Full time Away Goals': test_preds_a_g,
 })
result.head():
| | League | Match DateTime | Home Team | Away Team | Full time Home Goals | Full time Away Goals |
|-----|------------------------|----------------------------|--------------------|---------------------|------------------------|------------------------|
| 0 | English Premier League | 2021年03月12日 23:00:00.000001 | Leeds | Chelsea | 1.23 | 1.84 |
| 1 | English Premier League | 2021年03月12日 23:00:00.000001 | Crystal Palace | West Brom | 1.65 | 1.13 |
| 2 | English Premier League | 2021年03月12日 23:00:00.000001 | Everton | Burnley | 1.4 | 0.73 |
| 3 | English Premier League | 2021年03月12日 23:00:00.000001 | Fulham | Man City | 0.59 | 2.35 |
| 4 | English Premier League | 2021年03月13日 23:00:00.000001 | Southampton | Brighton | 1.34 | 1.36 |
| 5 | English Premier League | 2021年03月13日 23:00:00.000001 | Leicester | Sheffield United | 1.75 | 0.92 |
| 6 | English Premier League | 2021年03月13日 23:00:00.000001 | Arsenal | Tottenham | 1.37 | 1.19 |

My queston is predominantly:

  1. Am I applying the regression correctly? I am getting an output which (within the understanding of soccer matches and its outcome boundaries) is acceptable to real world output. How it comes together; I am partially aware of it.
  2. If I can understand it in layman terms, I am asking the question: 1- Predict Home team goals when two teams are playing based on historical data. 2. Predict the Away team goals in the same manner. But, these are discreet questions which are answered discreetly while the soccer event is not discreete but dynamic.(I can probably word this better definitely) Am I approaching this problem corrctly?
  3. Is there any way I can use .predict in a manner that I can get the HomeTeam and AwayTeam predictions at the same time? (AFAIK .predict can be only used for one variable, not multi variables)

If I apply GridsearchCV, the results dont improve (wrt real world comparison) hence there is no value in the processing resources/result tradeoff. Can I improve on this algorithim/apply better sklearn.predict options?

asked Mar 21, 2021 at 3:35
\$\endgroup\$
2
  • \$\begingroup\$ Can I see the full code? Some of the dataframes are not defined... \$\endgroup\$ Commented Jul 16, 2021 at 20:53
  • \$\begingroup\$ I have since changed the code and thus use updated dataframes.. however the question remains same and unanswered. lol. Also, to answer your question and with CR principles,, I will dig into archives and edit the question to help answer it. \$\endgroup\$ Commented Jul 17, 2021 at 2:08

1 Answer 1

2
\$\begingroup\$
  1. Am I applying the regression correctly?

Yes, within the limits of the problem you are posing.

In particular, I didn't notice any data leakage.

  1. Predict {Home, Away} team goals ... Am I approaching this problem corrctly?

It's not clear to me that that's the most interesting problem to pose. The London bookies typically have some point spread favoring one of the teams. You might prefer to predict the goals delta, with {Home, Away} being boolean features on the teams.

The ideas behind Elo numeric ratings might prove helpful. Knowing the Elo rating of the Home team, and of the Away team, would be very useful to a learner, and at inference time.

  1. any way I can use .predict in a manner that I can get the HomeTeam and AwayTeam predictions at the same time?

A RandomForestRegressor is a perfectly sensible model to apply here. It will predict just one thing at a time. So you need separate regressor models to predict separate target variables. As I mentioned, perhaps what you really want to predict is a difference.

Neural nets and other model types may output a vector of target variables with each inference.

answered Apr 29, 2024 at 22:10
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.