Python/Sklearn code optimization (New at Python)

Question 1

I am trying to optimize the following code, as you may see I am quite new to python but I am trying to improve, any kind of tip or code changes is welcome. I am trying to optimise this module by keeping it as short as possible but still understandable for everyone.

from copy import deepcopy
import pandas as pd
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from draw_benchmark_graphs import draw_benchmark_graphs
def preparate_dataset():
 # Load the dataset.
 dataframe = pd.read_csv('datasets/eggs_data_set.csv')
 features_df = deepcopy(dataframe)
 # Remove the wanted dataset output from the features-list.
 del features_df['days_remaning']
 # Create the Features(X) & Output(Y) arrays.
 features = features_df.as_matrix()
 output = dataframe['days_remaning'].as_matrix()
 # Split the data in a 70% to 30% ratio,the 30% will be kept hidden for testing porpuses.
 x_train, x_test, y_train, y_test = train_test_split(
 features, output, test_size=0.3)
 return [x_train, y_train, x_test, y_test]
def generate_new_model():
 dataset = preparate_dataset()
 x_train = dataset[0]
 y_train = dataset[1]
 x_test = dataset[2]
 y_test = dataset[3]
 # Create & fit a GB model using GridSearch optimised parameters.
 params = {
 'learning_rate': 0.01,
 'loss': 'ls',
 'max_depth': 4,
 'max_features': 0.3,
 'min_samples_leaf': 9,
 'n_estimators': 500
 }
 model = ensemble.GradientBoostingRegressor(**params)
 model.fit(x_train, y_train)
 # Find the error rate on the training set.
 mae_train = mean_absolute_error(y_train, model.predict(x_train))
 print('Training Set Mean Absolute Error: %.4f' % mae_train)
 # Find the error rate on the test set.
 mae_test = mean_absolute_error(y_test, model.predict(x_test))
 print('Test Set Mean Absolute Error: %.4f' % mae_test)
 # Save the trained model to a file for later use,
 # otherwise we will need to train a new model everytime we want to use it.
 joblib.dump(model, 'models/trained_eggs_classifier_model.pkl')
 draw_benchmark_graphs(model, params, x_test, y_test, mae_train, mae_test)
 return (model, (mae_train, mae_test))
generate_new_model()

Question 2

I can't give you any advice on performance but here are some notes on coding style:

There is no need to add comments to obvious things. You can safely delete all of them and it won't hurt readability. But you could make a docstring from # Create & fit a GB model using GridSearch optimised parameters.
Instead of generate_new_model() in the end of your script write this:
```
if __name__ == '__main__':
 generate_new_model()
```
You can read about it for example here.
You have some hard-coded paths and magic numbers. How about taking them to function's signatures? You also could use type hints.
It is a common practice to name dataframes simply as df instead of dataframe. But it's not really critical.
Instead of deepcopy you could use a copy method, and a drop method instead of del.
Why using as_matrix()? As docs say, train_test_split accepts pandas dataframes.

Instead of writing:

x_train, x_test, y_train, y_test = train_test_split(
 features, output, test_size=0.3)
return [x_train, y_train, x_test, y_test]

You can write just:

return train_test_split(features, output, test_size=0.3)

and change the order in the place where you call this function.

Instead of:

dataset = prepare_dataset()
x_train = dataset[0]
x_test = dataset[1]
y_train = dataset[2]
y_test = dataset[3]

you can write:

x_train, x_test, y_train, y_test = prepare_dataset()

Personally, I prefer dict constructors:

params = dict(learning_rate=0.01, 
 loss='ls',
 max_depth=4,
 max_features=0.3,
 min_samples_leaf=9,
 n_estimators=500)

It took me some time to understand the meaning of mae. Can we just name those variables as train_error and test_error?
I suggest adding keyword arguments to your call of draw_benchmark_graphs, so it would look something like this:
```
draw_benchmark_graphs(model=model,
 parameters=params, 
 x_test=x_test,
 y_test=y_test,
 train_error=train_error,
 test_error=test_error)
```
Also from draw_benchmark_graphs import draw_benchmark_graphs looks strange. Choose better names. How about import draw and call the function like draw.benchmark_graphs(...)?
You have redundant parentheses in return. Also, why do you pack errors in a tuple? Flat is better than nested, so make your return like this:
```
return model, train_error, test_error
```
Think of splitting your script to more functions. How you do it depends on your workflow. It's just your function is called generate_new_model but what you actually do is: reading csv file to dataframe, splitting it to samples(or how you call it), instantiating and fitting a model, calculating and printing errors, saving model in a file and drawing a graph. Also, you return data consists of different types which I think is not a good thing.

Here is how it can look like in the end. Except the fact that I didn't include docstrings and didn't try to split it to smaller functions.

from typing import (Tuple,
 List)
import pandas as pd
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import draw
def generate_new_model(
 path: str = 'datasets/eggs_data_set.csv',
 *,
 save_path: str = 'models/trained_eggs_classifier_model.pkl',
 output_column: str = 'days_remaining',
 test_size: float = 0.3,
 learning_rate: float = 0.01,
 loss: str = 'ls',
 max_depth: int = 4,
 max_features: float = 0.3,
 min_samples_leaf: int = 9,
 n_estimators: int = 500,
 **kwargs) -> Tuple[ensemble.GradientBoostingRegressor, float, float]:
 (x_train, x_test,
 y_train, y_test) = prepare_data(path,
 output_column=output_column,
 test_size=test_size)
 model_parameters = dict(learning_rate=learning_rate,
 loss=loss,
 max_depth=max_depth,
 max_features=max_features,
 min_samples_leaf=min_samples_leaf,
 n_estimators=n_estimators,
 **kwargs)
 model = ensemble.GradientBoostingRegressor(**model_parameters)
 model.fit(x_train, y_train)
 train_error = mean_absolute_error(y_train, model.predict(x_train))
 test_error = mean_absolute_error(y_test, model.predict(x_test))
 print(f'Training Set Mean Absolute Error: {train_error:.4f}')
 print(f'Test Set Mean Absolute Error: {test_error:.4f}')
 joblib.dump(model, save_path)
 draw.benchmark_graphs(model=model,
 model_parameters=model_parameters,
 x_test=x_test,
 y_test=y_test,
 train_error=train_error,
 test_error=test_error)
 return model, train_error, test_error
def prepare_data(path: str,
 *,
 output_column: str,
 test_size: float) -> List[pd.DataFrame]:
 df = pd.read_csv(path)
 features = df.drop(output_column, axis=1)
 output = df[output_column]
 return train_test_split(features,
 output,
 test_size=test_size)
if __name__ == '__main__':
 generate_new_model()

Question 3

Hey, first of all, I should like to thank you for the time and effort you put in your answer, it gave me a lot of new insights. I am however wondering why you declare the var datatype? Is it only for readability? @Georgy

Question 4

Also, what does the single * means?

Question 5

@VeraPerrone I'm happy to help! If you mean type specifications in function's signatures then yes, this is just to be explicit about what type of data you want to pass to a function and return from it. So, it won't change your algorithm or anything :) Also, some IDEs can hint you if you pass wrong type. E.g.: generate_new_model(path=2) would show you: Expected type 'str', got 'int' instead

Question 6

@VeraPerrone Single * in function's signature means that all the arguments that go after it are keyword arguments. So, for example without asterisk you could call prepare_data('some_path', 'days_remaining', 0.3) but in my example you have to call it like prepare_data('some_path', output_column='days_remaining', test_size=0.3)

Question 7

@VeraPerrone In the last comment it should be 'keyword-only' instead of 'keyword'. Example

Georgy Georgy 1,9972 gold badges15 silver badges27 bronze badges · Accepted Answer · 2017-12-05 15:06:35Z

I can't give you any advice on performance but here are some notes on coding style:

There is no need to add comments to obvious things. You can safely delete all of them and it won't hurt readability. But you could make a docstring from # Create & fit a GB model using GridSearch optimised parameters.
Instead of generate_new_model() in the end of your script write this:
```
if __name__ == '__main__':
 generate_new_model()
```
You can read about it for example here.
You have some hard-coded paths and magic numbers. How about taking them to function's signatures? You also could use type hints.
It is a common practice to name dataframes simply as df instead of dataframe. But it's not really critical.
Instead of deepcopy you could use a copy method, and a drop method instead of del.
Why using as_matrix()? As docs say, train_test_split accepts pandas dataframes.

Instead of writing:

x_train, x_test, y_train, y_test = train_test_split(
 features, output, test_size=0.3)
return [x_train, y_train, x_test, y_test]

You can write just:

return train_test_split(features, output, test_size=0.3)

and change the order in the place where you call this function.

Instead of:

dataset = prepare_dataset()
x_train = dataset[0]
x_test = dataset[1]
y_train = dataset[2]
y_test = dataset[3]

you can write:

x_train, x_test, y_train, y_test = prepare_dataset()

Personally, I prefer dict constructors:

params = dict(learning_rate=0.01, 
 loss='ls',
 max_depth=4,
 max_features=0.3,
 min_samples_leaf=9,
 n_estimators=500)

It took me some time to understand the meaning of mae. Can we just name those variables as train_error and test_error?
I suggest adding keyword arguments to your call of draw_benchmark_graphs, so it would look something like this:
```
draw_benchmark_graphs(model=model,
 parameters=params, 
 x_test=x_test,
 y_test=y_test,
 train_error=train_error,
 test_error=test_error)
```
Also from draw_benchmark_graphs import draw_benchmark_graphs looks strange. Choose better names. How about import draw and call the function like draw.benchmark_graphs(...)?
You have redundant parentheses in return. Also, why do you pack errors in a tuple? Flat is better than nested, so make your return like this:
```
return model, train_error, test_error
```
Think of splitting your script to more functions. How you do it depends on your workflow. It's just your function is called generate_new_model but what you actually do is: reading csv file to dataframe, splitting it to samples(or how you call it), instantiating and fitting a model, calculating and printing errors, saving model in a file and drawing a graph. Also, you return data consists of different types which I think is not a good thing.

Here is how it can look like in the end. Except the fact that I didn't include docstrings and didn't try to split it to smaller functions.

from typing import (Tuple,
 List)
import pandas as pd
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import draw
def generate_new_model(
 path: str = 'datasets/eggs_data_set.csv',
 *,
 save_path: str = 'models/trained_eggs_classifier_model.pkl',
 output_column: str = 'days_remaining',
 test_size: float = 0.3,
 learning_rate: float = 0.01,
 loss: str = 'ls',
 max_depth: int = 4,
 max_features: float = 0.3,
 min_samples_leaf: int = 9,
 n_estimators: int = 500,
 **kwargs) -> Tuple[ensemble.GradientBoostingRegressor, float, float]:
 (x_train, x_test,
 y_train, y_test) = prepare_data(path,
 output_column=output_column,
 test_size=test_size)
 model_parameters = dict(learning_rate=learning_rate,
 loss=loss,
 max_depth=max_depth,
 max_features=max_features,
 min_samples_leaf=min_samples_leaf,
 n_estimators=n_estimators,
 **kwargs)
 model = ensemble.GradientBoostingRegressor(**model_parameters)
 model.fit(x_train, y_train)
 train_error = mean_absolute_error(y_train, model.predict(x_train))
 test_error = mean_absolute_error(y_test, model.predict(x_test))
 print(f'Training Set Mean Absolute Error: {train_error:.4f}')
 print(f'Test Set Mean Absolute Error: {test_error:.4f}')
 joblib.dump(model, save_path)
 draw.benchmark_graphs(model=model,
 model_parameters=model_parameters,
 x_test=x_test,
 y_test=y_test,
 train_error=train_error,
 test_error=test_error)
 return model, train_error, test_error
def prepare_data(path: str,
 *,
 output_column: str,
 test_size: float) -> List[pd.DataFrame]:
 df = pd.read_csv(path)
 features = df.drop(output_column, axis=1)
 output = df[output_column]
 return train_test_split(features,
 output,
 test_size=test_size)
if __name__ == '__main__':
 generate_new_model()

Hey, first of all, I should like to thank you for the time and effort you put in your answer, it gave me a lot of new insights. I am however wondering why you declare the var datatype? Is it only for readability? @Georgy
@VeraPerrone I'm happy to help! If you mean type specifications in function's signatures then yes, this is just to be explicit about what type of data you want to pass to a function and return from it. So, it won't change your algorithm or anything :) Also, some IDEs can hint you if you pass wrong type. E.g.: generate_new_model(path=2) would show you: Expected type 'str', got 'int' instead
@VeraPerrone Single * in function's signature means that all the arguments that go after it are keyword arguments. So, for example without asterisk you could call prepare_data('some_path', 'days_remaining', 0.3) but in my example you have to call it like prepare_data('some_path', output_column='days_remaining', test_size=0.3)
@VeraPerrone In the last comment it should be 'keyword-only' instead of 'keyword'. Example

Stack Exchange Network

Python/Sklearn code optimization (New at Python)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Python/Sklearn code optimization (New at Python)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions