I am using RandomForestRegressor to predict a set of soccer matches to predict the number of goals scored for a given match/set of matches is as below:
dte.head()
HomeTeam
AwayTeam B365H B365D B365A FTHG FTAG
-- -------------- ---------- ------- ------- ------- ------ ------
0 Leeds Chelsea 4.8 3.9 1.68 nan nan
1 Crystal Palace West Brom 2.28 3 3.4 nan nan
2 Everton Burnley 1.95 3.2 4.2 nan nan
3 Fulham Man City 10.5 5 1.3 nan nan
4 Southampton Brighton 2.9 3.05 2.55 nan nan
dtr.head()
HomeTeam AwayTeam FTHG FTAG
-- -------------- ----------- ------ ------
0 Fulham Arsenal 0 3
1 Crystal Palace Southampton 1 0
2 Liverpool Leeds 4 3
3 West Ham Newcastle 0 2
4 West Brom Leicester 0 3
def encode_features(df_train, df_test):
features = ['HomeTeam', 'AwayTeam']
df_combined = pd.concat([df_train[features], df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(df_combined[feature])
df_train[feature] = le.transform(df_train[feature])
df_test[feature] = le.transform(df_test[feature])
return df_train, df_test
dtr, dte = encode_features(dtr, dte)
dtr_g = dtr.loc[:, dte.columns.intersection(
['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'B365H', 'B365D', 'B365A'])]
dte_g = dte.loc[:, dte.columns.intersection(
['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'B365H', 'B365D', 'B365A'])]
dtr_g_h = dtr_g.drop(['FTAG'], axis=1)
dtr_g_a = dtr_g.drop(['FTHG'], axis=1)
dte_g = dte_g.drop(['FTHG', 'FTAG'], axis=1)
dte_g = dte_g.dropna()
# Encoding string features
def encode_features(df_train, df_test):
features = ['HomeTeam', 'AwayTeam']
df_combined = pd.concat([df_train[features], df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(df_combined[feature])
df_train[feature] = le.transform(df_train[feature])
df_test[feature] = le.transform(df_test[feature])
return df_train, df_test
dtr, dte = encode_features(dtr, dte)
# After Encoding
dte_g.head()
HomeTeam AwayTeam B365H B365D B365A Month Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- -----------------
0 271 114 4.8 3.9 1.68 3 2
1 134 495 2.28 3 3.4 3 2
2 170 88 1.95 3.2 4.2 3 2
3 194 297 10.5 5 1.3 3 2
4 429 83 2.9 3.05 2.55 3 1
dtr_g_h.head()
HomeTeam AwayTeam B365H B365D B365A Month FTHG Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- ------ -----------------
0 194 33 6 4.33 1.53 9 0 1
1 134 428 3.1 3.25 2.37 9 1 1
2 282 271 1.28 6 9.5 9 4 1
3 497 326 2.15 3.4 3.4 9 0 1
4 496 273 3.8 3.6 1.95 9 0 1
dtr_g_a.head()
HomeTeam AwayTeam B365H B365D B365A Month FTAG Weekend/Weekday
-- ---------- ---------- ------- ------- ------- ------- ------ -----------------
0 194 33 6 4.33 1.53 9 3 1
1 134 428 3.1 3.25 2.37 9 0 1
2 282 271 1.28 6 9.5 9 3 1
3 497 326 2.15 3.4 3.4 9 2 1
4 496 273 3.8 3.6 1.95 9 3 1
# Predicing Goals
X_g_h = dtr_g_h.drop(['FTHG'], axis=1)
y_g_h = dtr_g_h['FTHG']
X_g_a = dtr_g_a.drop(['FTAG'], axis=1)
y_g_a = dtr_g_a['FTAG']
print("Splitting for Home")
train_X_g_h, val_X_g_h, train_y_g_h, val_y_g_h = train_test_split(X_g_h, y_g_h, test_size=0.2, random_state=1)
print("Splitting for away goals")
train_X_g_a, val_X_g_a, train_y_g_a, val_y_a = train_test_split(X_g_a, y_g_a, test_size=0.2, random_state=1)
print("Running Random Forest model")
rf_model_on_full_data_g_h = RandomForestRegressor()
rf_model_on_full_data_g_a = RandomForestRegressor()
print("Fitting for home goals")
rf_model_on_full_data_g_h.fit(X_g_h, y_g_h)
rf_model_on_full_data_g_a.fit(X_g_a, y_g_a)
print("Predicting goals for Home Team")
test_preds_h_g = rf_model_on_full_data_g_h.predict(dte_g)
print("Predicting goals for Away Team")
test_preds_a_g = rf_model_on_full_data_g_a.predict(dte_g)
result = pd.DataFrame({
'League': dte_input.League,
'Match DateTime': dte_input.DateTime,
'Home Team': dte_input.HomeTeam,
'Away Team': dte_input.AwayTeam,
'Full time Home Goals': test_preds_h_g,
'Full time Away Goals': test_preds_a_g,
})
result.head():
| | League | Match DateTime | Home Team | Away Team | Full time Home Goals | Full time Away Goals |
|-----|------------------------|----------------------------|--------------------|---------------------|------------------------|------------------------|
| 0 | English Premier League | 2021年03月12日 23:00:00.000001 | Leeds | Chelsea | 1.23 | 1.84 |
| 1 | English Premier League | 2021年03月12日 23:00:00.000001 | Crystal Palace | West Brom | 1.65 | 1.13 |
| 2 | English Premier League | 2021年03月12日 23:00:00.000001 | Everton | Burnley | 1.4 | 0.73 |
| 3 | English Premier League | 2021年03月12日 23:00:00.000001 | Fulham | Man City | 0.59 | 2.35 |
| 4 | English Premier League | 2021年03月13日 23:00:00.000001 | Southampton | Brighton | 1.34 | 1.36 |
| 5 | English Premier League | 2021年03月13日 23:00:00.000001 | Leicester | Sheffield United | 1.75 | 0.92 |
| 6 | English Premier League | 2021年03月13日 23:00:00.000001 | Arsenal | Tottenham | 1.37 | 1.19 |
My queston is predominantly:
- Am I applying the regression correctly? I am getting an output which (within the understanding of soccer matches and its outcome boundaries) is acceptable to real world output. How it comes together; I am partially aware of it.
- If I can understand it in layman terms, I am asking the question: 1- Predict Home team goals when two teams are playing based on historical data. 2. Predict the Away team goals in the same manner. But, these are discreet questions which are answered discreetly while the soccer event is not discreete but dynamic.(I can probably word this better definitely) Am I approaching this problem corrctly?
- Is there any way I can use
.predict
in a manner that I can get the HomeTeam and AwayTeam predictions at the same time? (AFAIK.predict
can be only used for one variable, not multi variables)
If I apply GridsearchCV
, the results dont improve (wrt real world comparison) hence there is no value in the processing resources/result tradeoff. Can I improve on this algorithim/apply better sklearn.predict
options?
-
\$\begingroup\$ Can I see the full code? Some of the dataframes are not defined... \$\endgroup\$Aruaman Ase– Aruaman Ase2021年07月16日 20:53:43 +00:00Commented Jul 16, 2021 at 20:53
-
\$\begingroup\$ I have since changed the code and thus use updated dataframes.. however the question remains same and unanswered. lol. Also, to answer your question and with CR principles,, I will dig into archives and edit the question to help answer it. \$\endgroup\$PyNoob– PyNoob2021年07月17日 02:08:39 +00:00Commented Jul 17, 2021 at 2:08
1 Answer 1
- Am I applying the regression correctly?
Yes, within the limits of the problem you are posing.
In particular, I didn't notice any data leakage.
- Predict {Home, Away} team goals ... Am I approaching this problem corrctly?
It's not clear to me that that's the most interesting problem to pose. The London bookies typically have some point spread favoring one of the teams. You might prefer to predict the goals delta, with {Home, Away} being boolean features on the teams.
The ideas behind Elo numeric ratings might prove helpful. Knowing the Elo rating of the Home team, and of the Away team, would be very useful to a learner, and at inference time.
- any way I can use .predict in a manner that I can get the HomeTeam and AwayTeam predictions at the same time?
A RandomForestRegressor is a perfectly sensible model to apply here. It will predict just one thing at a time. So you need separate regressor models to predict separate target variables. As I mentioned, perhaps what you really want to predict is a difference.
Neural nets and other model types may output a vector of target variables with each inference.