I am trying to solve a multi-class classification involving prediction the outcome of a football match (target variable = Win, Lose or Draw). With a dataset of 2280 rows, which is 6 seasons of football data.
I have features with both numerical and categorical values (which I have encoded using one encoding). The data is split into a train and test set, in a way so the test set is only the most recent season of data.
I wanted to understand as this is my first machine learning project if this overall process looks correct and if there is anything I should be doing better/more optimal.
Splitting data into train test split
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.naive_bayes import MultinomialNB
# Assign our target variable to label
label = match_df['FTR']
# Flatten the label array
y = np.ravel(label)
# Assign all columns expect the FTR column to the features variable
X = match_df.loc[:, match_df.columns != 'FTR']
# Split our data into training and testing sets
# We set shuffle to false as we want to keep the order of the matches in the data frame so we can use the 2022/2023 season as our test set
# Use a test size of 0.1665 as this will give us 380 test samples which is the same as the number of matches in a season
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1665, random_state=0, shuffle=False)
Testing our base model, then performing hyper parameter tuning
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Try without normalization also try min max scaler
# Normalize the data set as we have a several features with different data scales
scaler = MinMaxScaler()
# Fit the scaler to the training set and transform the training set
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create SVM model
svm_model = SVC(random_state=0)
# Create KNN model
knn_model = KNeighborsClassifier()
# Create Naive Bayes model
nb_model = GaussianNB()
# Create a dictionary of the models
models = {'KNN':knn_model, 'SVM':svm_model, 'Naive_Bayes':nb_model}
# Create the StratifiedKFold object
skf = StratifiedKFold(n_splits=10)
# Train the base models and evaluate them using cross validation
for model_name, model in models.items():
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=skf)
print(f"Accuracy during cross validation for BASE {model_name}: {scores.mean()}")
# Perform hyper parameter tuning on each model using grid search
# Create a dictionary of hyper parameters for each model we want to tune
svm_parameters = {'kernel':['poly', 'rbf', 'linear'], 'C':[0.1, 1, 10, 100], 'gamma':['scale', 'auto', 0.1, 1], 'degree':list(range(1, 10))}
# For knn neighbor param, we make sure it is odd to prevent ties
knn_parameters = {'n_neighbors':[i for i in range(2, 31) if i % 2 != 0], 'weights':['uniform', 'distance'], 'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size':[i for i in range(1, 40)], 'p':[1, 2], 'metric':['minkowski', 'euclidean', 'manhattan']}
nb_parameters = {'var_smoothing':[1e-09, 1e-08, 1e-07, 1e-06, 1e-05]}
# Create a dictionary of the parameters
parameters = {'SVM':svm_parameters, 'KNN':knn_parameters, 'Naive_Bayes':nb_parameters}
# import scoring metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import make_scorer
scoring = {
'accuracy': make_scorer(accuracy_score),
'balanced_accuracy': make_scorer(balanced_accuracy_score),
'precision': make_scorer(precision_score, average='macro'),
'recall': make_scorer(recall_score, average='macro'),
'f1': make_scorer(f1_score, average='macro')
}
# Loop through each model and perform hyper parameter tuning
for model_name, model in models.items():
print(f"Performing hyper parameter tuning on {model_name}...")
# Create a grid search object and fit it to the data to perform hyper parameter tuning
search = GridSearchCV(estimator=model, param_grid=parameters[model_name], scoring=scoring, refit='accuracy', cv=skf, n_jobs=-1)
# Fit the grid search object to the train data
searchResults = search.fit(X_train, y_train)
# Get the optimal hyper parameters and corresponding accuracy score
print(f"Best parameters: {search.best_params_}, Best Score: {search.best_score_}")
print("Evaluating the model on the test data...")
bestModel = searchResults.best_estimator_
print(bestModel)
print(f"Test Score: {bestModel.score(X_test, y_test)}\n\n")
# Fit the best parameters to the model
models[model_name] = bestModel
Final test of our hyper parameter tuning models and display their confusion matrix
for model_name, model in models.items():
# Produce a confusion matrix for the final model
conf_matrix = confusion_matrix(y_test, model.predict(X_test))
# Plot the confusion matrix
sns.heatmap(conf_matrix, annot=True, cmap='Blues')
# Set our x, y labels and title
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title(f'Confusion Matrix for {model_name}')
# Display the plot
plt.show()
1 Answer 1
missing review context
label = match_df['FTR']
This line makes no sense,
as it will produce "NameError: name 'match_df' is not defined";
we didn't define it in previous code such as import
s.
extra temp var
Maybe do it all in one go?
Since we do not later refer to label
.
y = np.ravel(match_df['FTR'])
I do thank you for the helpful reminder that ravel() means "flatten".
(Some other comments, like "fit scaler ... transform", just say what the code says and could be elided.)
nit, typo: "expect" --> "except"
comment could be code
# Use a test size of 0.1665 as this will give us 380 test samples which is the same as the number of matches in a season
This is a helpful comment and I thank you for it.
(Oddly, final digit is 5
rather than 7
.)
It makes an assertion about how our data relates to the real world.
Assertions are more believable when they are code instead of prose. Usually comments start out being true, but then they bit-rot as the code changes and the comments don't keep up. Consider rephrasing this as
matches_per_season = 380
test_size = matches_per_season / len(y)
assert round(test_size, 4) == 0.1667
But wait!
Perhaps confusingly, perhaps conveniently,
train_test_split
behaves differently according to
whether the parameter is in the unit interval or is a large integer.
We could more clearly convey Author's Intent by simply saying
matches_per_season = 380
assert len(y) == 6 * matches_per_season # dataset covers six seasons
..., ..., ..., ... = train_test_split(X, y, test_size=matches_per_season, ... )
magic number
skf = StratifiedKFold(n_splits=10)
Number of splits would have defaulted to 5
,
a natural fit for the number of seasons you're testing on.
Splitting into {first, second} half of each season seems arbitrary,
and worth a #
comment.
On the bright side, at least we're using a multiple of five.
My maintenance concern is that someone may change this to, say, 8
,
and then observe bigger effect than anticipated, puzzling them.
Had you shuffled the time series of scores up top, none of these concerns would be relevant here. But having decided to preserve the time series, that colors how we look at these subsequent pipeline stages.
step parameter
# For knn neighbor param, we make sure it is odd to prevent ties
knn_parameters = {'n_neighbors':[i for i in range(2, 31) if i % 2 != 0]
The "tie breaking" comment is helpful.
Ending the range at odd 31
is unusual, given that the final i
tested will be even 30
which is rejected.
Starting at 2
is similarly unusual, and does not aid human cognition.
range
takes a 3rd parameter. Prefer:
[i for i in range(3, 30, 2)]
which of course is simply
list(range(3, 30, 2))
The grid searching could be better motivated. As stated it looks like "throw stuff at the wall to see what sticks".
extract helpers
Each time you write a helpful comment like this:
# Loop through each model and perform hyper parameter tuning
it suggests that you might have written something
like def tune_hyperparameters():
The biggest advantage of extracting such helpers is that all their
local variables go out-of-scope when they return
, thereby reducing
coupling
and the cognitive load that comes from juggling all those global variables.
Clearly showing what is fed into a function, and what it produces,
is helpful for future maintenance engineers.
It will also give you a place to add a """docstring""". And when a function morphs into doing something a little different, a function name that "lies" is more likely to be updated to tell the truth, compared with prose trying to tell the same story.
It also admits of unit testing.
choosing identifiers
searchResults = ...
bestModel = ...
Pep-8
asks that you spell these search_results
and best_model
.
We preserved causal (chronological) order among example rows, but it's not clear that any of the models benefit from that.
Kudos for labeling your axes!
This ML exercise hews pretty closely to standard textbook formats, and achieves its design goals.
I would be willing to delegate or assign maintenance tasks on this code.
-
\$\begingroup\$ Thank you for the help, I will go ahead and adjust what you recommend. In terms of the overall process does it seem correct? e.g. the way I am training and testing my code? As there seems to be a million different answers and methods for a machine learning structure, and as I dont have much previous experience not sure if i am making a stupid mistake. Also regarding the "magic number" section, are you suggesting a cross fold of 5 would be better in my scenario then 10? \$\endgroup\$pastybake2002– pastybake20022024年02月05日 17:10:13 +00:00Commented Feb 5, 2024 at 17:10
-
\$\begingroup\$ I was saying the non-default setting wasn't motivated by anything in the Review Context nor by any comments in the code, so it had me scratching my head why
10
is somehow "better" than5
. // Does it mostly seem "correct", modulo various critiques? Yes, it does, it seems close to standard textbooks and web tutorials. In particular I didn't notice any data leakage of "test" rows into the "train" dataset, something I was nervous about as I reviewed the hyper-parameter tuning and scoring code. We could have better test/train separation if we had more helper functions --> fewer globals. \$\endgroup\$J_H– J_H2024年02月05日 17:27:06 +00:00Commented Feb 5, 2024 at 17:27