I am trying to optimize the following code, as you may see I am quite new to python but I am trying to improve, any kind of tip or code changes is welcome. I am trying to optimise this module by keeping it as short as possible but still understandable for everyone.
from copy import deepcopy
import pandas as pd
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from draw_benchmark_graphs import draw_benchmark_graphs
def preparate_dataset():
# Load the dataset.
dataframe = pd.read_csv('datasets/eggs_data_set.csv')
features_df = deepcopy(dataframe)
# Remove the wanted dataset output from the features-list.
del features_df['days_remaning']
# Create the Features(X) & Output(Y) arrays.
features = features_df.as_matrix()
output = dataframe['days_remaning'].as_matrix()
# Split the data in a 70% to 30% ratio,the 30% will be kept hidden for testing porpuses.
x_train, x_test, y_train, y_test = train_test_split(
features, output, test_size=0.3)
return [x_train, y_train, x_test, y_test]
def generate_new_model():
dataset = preparate_dataset()
x_train = dataset[0]
y_train = dataset[1]
x_test = dataset[2]
y_test = dataset[3]
# Create & fit a GB model using GridSearch optimised parameters.
params = {
'learning_rate': 0.01,
'loss': 'ls',
'max_depth': 4,
'max_features': 0.3,
'min_samples_leaf': 9,
'n_estimators': 500
}
model = ensemble.GradientBoostingRegressor(**params)
model.fit(x_train, y_train)
# Find the error rate on the training set.
mae_train = mean_absolute_error(y_train, model.predict(x_train))
print('Training Set Mean Absolute Error: %.4f' % mae_train)
# Find the error rate on the test set.
mae_test = mean_absolute_error(y_test, model.predict(x_test))
print('Test Set Mean Absolute Error: %.4f' % mae_test)
# Save the trained model to a file for later use,
# otherwise we will need to train a new model everytime we want to use it.
joblib.dump(model, 'models/trained_eggs_classifier_model.pkl')
draw_benchmark_graphs(model, params, x_test, y_test, mae_train, mae_test)
return (model, (mae_train, mae_test))
generate_new_model()
1 Answer 1
I can't give you any advice on performance but here are some notes on coding style:
- There is no need to add comments to obvious things. You can safely delete all of them and it won't hurt readability. But you could make a docstring from
# Create & fit a GB model using GridSearch optimised parameters.
Instead of
generate_new_model()
in the end of your script write this:if __name__ == '__main__': generate_new_model()
You can read about it for example here.
You have some hard-coded paths and magic numbers. How about taking them to function's signatures? You also could use type hints.
It is a common practice to name dataframes simply as
df
instead ofdataframe
. But it's not really critical.Instead of
deepcopy
you could use acopy
method, and adrop
method instead ofdel
.Why using
as_matrix()
? As docs say,train_test_split
accepts pandas dataframes.Instead of writing:
x_train, x_test, y_train, y_test = train_test_split( features, output, test_size=0.3) return [x_train, y_train, x_test, y_test]
You can write just:
return train_test_split(features, output, test_size=0.3)
and change the order in the place where you call this function.
Instead of:
dataset = prepare_dataset() x_train = dataset[0] x_test = dataset[1] y_train = dataset[2] y_test = dataset[3]
you can write:
x_train, x_test, y_train, y_test = prepare_dataset()
Personally, I prefer dict constructors:
params = dict(learning_rate=0.01, loss='ls', max_depth=4, max_features=0.3, min_samples_leaf=9, n_estimators=500)
It took me some time to understand the meaning of
mae
. Can we just name those variables astrain_error
andtest_error
?I suggest adding keyword arguments to your call of
draw_benchmark_graphs
, so it would look something like this:draw_benchmark_graphs(model=model, parameters=params, x_test=x_test, y_test=y_test, train_error=train_error, test_error=test_error)
Also
from draw_benchmark_graphs import draw_benchmark_graphs
looks strange. Choose better names. How aboutimport draw
and call the function likedraw.benchmark_graphs(...)
?You have redundant parentheses in
return
. Also, why do you pack errors in a tuple? Flat is better than nested, so make yourreturn
like this:return model, train_error, test_error
Think of splitting your script to more functions. How you do it depends on your workflow. It's just your function is called
generate_new_model
but what you actually do is: reading csv file to dataframe, splitting it to samples(or how you call it), instantiating and fitting a model, calculating and printing errors, saving model in a file and drawing a graph. Also, you return data consists of different types which I think is not a good thing.
Here is how it can look like in the end. Except the fact that I didn't include docstrings and didn't try to split it to smaller functions.
from typing import (Tuple,
List)
import pandas as pd
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import draw
def generate_new_model(
path: str = 'datasets/eggs_data_set.csv',
*,
save_path: str = 'models/trained_eggs_classifier_model.pkl',
output_column: str = 'days_remaining',
test_size: float = 0.3,
learning_rate: float = 0.01,
loss: str = 'ls',
max_depth: int = 4,
max_features: float = 0.3,
min_samples_leaf: int = 9,
n_estimators: int = 500,
**kwargs) -> Tuple[ensemble.GradientBoostingRegressor, float, float]:
(x_train, x_test,
y_train, y_test) = prepare_data(path,
output_column=output_column,
test_size=test_size)
model_parameters = dict(learning_rate=learning_rate,
loss=loss,
max_depth=max_depth,
max_features=max_features,
min_samples_leaf=min_samples_leaf,
n_estimators=n_estimators,
**kwargs)
model = ensemble.GradientBoostingRegressor(**model_parameters)
model.fit(x_train, y_train)
train_error = mean_absolute_error(y_train, model.predict(x_train))
test_error = mean_absolute_error(y_test, model.predict(x_test))
print(f'Training Set Mean Absolute Error: {train_error:.4f}')
print(f'Test Set Mean Absolute Error: {test_error:.4f}')
joblib.dump(model, save_path)
draw.benchmark_graphs(model=model,
model_parameters=model_parameters,
x_test=x_test,
y_test=y_test,
train_error=train_error,
test_error=test_error)
return model, train_error, test_error
def prepare_data(path: str,
*,
output_column: str,
test_size: float) -> List[pd.DataFrame]:
df = pd.read_csv(path)
features = df.drop(output_column, axis=1)
output = df[output_column]
return train_test_split(features,
output,
test_size=test_size)
if __name__ == '__main__':
generate_new_model()
-
\$\begingroup\$ Hey, first of all, I should like to thank you for the time and effort you put in your answer, it gave me a lot of new insights. I am however wondering why you declare the var datatype? Is it only for readability? @Georgy \$\endgroup\$Vera Perrone– Vera Perrone2017年12月05日 15:44:48 +00:00Commented Dec 5, 2017 at 15:44
-
\$\begingroup\$ Also, what does the single * means? \$\endgroup\$Vera Perrone– Vera Perrone2017年12月05日 16:04:40 +00:00Commented Dec 5, 2017 at 16:04
-
\$\begingroup\$ @VeraPerrone I'm happy to help! If you mean type specifications in function's signatures then yes, this is just to be explicit about what type of data you want to pass to a function and return from it. So, it won't change your algorithm or anything :) Also, some IDEs can hint you if you pass wrong type. E.g.:
generate_new_model(path=2)
would show you:Expected type 'str', got 'int' instead
\$\endgroup\$Georgy– Georgy2017年12月05日 16:17:29 +00:00Commented Dec 5, 2017 at 16:17 -
\$\begingroup\$ @VeraPerrone Single
*
in function's signature means that all the arguments that go after it are keyword arguments. So, for example without asterisk you could callprepare_data('some_path', 'days_remaining', 0.3)
but in my example you have to call it likeprepare_data('some_path', output_column='days_remaining', test_size=0.3)
\$\endgroup\$Georgy– Georgy2017年12月05日 16:25:28 +00:00Commented Dec 5, 2017 at 16:25 -