This is a follow-up question to My last question. From then on, I think that I have improved quite a lot thanks to a large amount of feedback I received from my last post. This will be the last time I ask for feedback based on this code. I am looking for any feedback on best practices or coding style but to be honest any kind of tip, information or feedback is greatly accepted. This is my first real python module so I can imagine there is quite a bit of strange or simply wrong stuff.
import os.path
import math
from typing import (Tuple, List)
import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
def prepare_dataset(path: str) -> List[pd.DataFrame]:
df = pd.read_csv(path)
features = df.drop('days_remaning', axis=1)
output = df['days_remaning']
return train_test_split(features, output, test_size=0.3)
class Model():
def __init__(self, product: str) -> None:
self.product = product
self.save_path = f'models/trained_{product}_classifier_model.pkl'
(self.x_train, self.x_test, self.y_train, self.y_test) = prepare_dataset(
f'datasets/{product}_dataset.csv')
if os.path.isfile(self.save_path):
print(f'Loading Local Model, product: {product}.')
self.model = joblib.load(self.save_path)
else:
print(f'Creating New Model, product: {product}.')
self.model = self.create_model()
def create_model(self) -> ensemble.GradientBoostingRegressor:
'''Create & fit a GB model using GridSearch optimised parameters.'''
params = dict(
learning_rate=0.01,
loss='ls',
max_depth=4,
max_features=0.3,
min_samples_leaf=9,
n_estimators=500
)
model = ensemble.GradientBoostingRegressor(**params)
model.fit(self.x_train, self.y_train)
return model
def use_model(self, data: List[List[int]]) -> np.ndarray:
return self.model.predict(data)
def print_model_output(self, data: List[List[int]]) -> None:
for output in self.use_model(data):
print('In about {} to {} days you will need to buy new {}.'.format(
math.floor(output), math.ceil(output), self.product))
def save_model(self) -> None:
joblib.dump(self.model, self.save_path)
def calculate_error_rates(self) -> Tuple[float]:
train_error = mean_absolute_error(
self.y_train, self.model.predict(self.x_train))
test_error = mean_absolute_error(
self.y_test, self.model.predict(self.x_test))
return train_error, test_error
def print_error_rates(self) -> None:
(train_error, test_error) = self.calculate_error_rates()
print(f'Training Set Mean Absolute Error: {train_error:.4f}')
print(f'Test Set Mean Absolute Error: {test_error:.4f}')
def print_features_importance(self):
features_labels = np.array(['days_after_restock', 'left_content'])
importance = self.model.feature_importances_
features_indexes_by_importance = importance.argsort()
for feature in features_indexes_by_importance:
print(
f'{features_labels[feature]} - {importance[feature] * 100.0:.2f}%')
if __name__ == '__main__':
EGGS = Model('eggs')
EGGS.print_error_rates()
EGGS.print_features_importance()
EGGS.print_model_output([[2, 7], [7, 2]])
1 Answer 1
General: docstrings should explain the parameters.
features = df.drop('days_remaning', axis=1)
output = df['days_remaning']
You really ought to put 'days_remaning'
into its own variable, declared either via a constant or read in from a configuration file, so that if the field name in your csv changes, you don't have to make the change in multiple places. Similar feeling for the file paths.
print(f'Loading Local Model, product: {product}.')
This is not output of the program, but just logging its progress. So you should not be using print()
but instead the logging
module. That way you can configure where the logging goes (and if it's shown at all).
if os.path.isfile(self.save_path):
print(f'Loading Local Model, product: {product}.')
self.model = joblib.load(self.save_path)
else:
print(f'Creating New Model, product: {product}.')
self.model = self.create_model()
This seems odd to me. If the file already exists you load it, but if it doesn't exist you save it? This is actually a race condition since whoever is running this code doesn't know what behavior will happen, based on whether the file exists or is deleted. However I'm not sure what should be done here instead.
params = dict(
learning_rate=0.01,
loss='ls',
max_depth=4,
max_features=0.3,
min_samples_leaf=9,
n_estimators=500
)
The dict
initializer syntax is more pythonic, and may save a function call.
params = {
learning_rate : 0.01,
loss : 'ls',
max_depth : 4,
max_features : 0.3,
min_samples_leaf : 9,
n_estimators : 500
}
but why are you even creating the params
variable at all? Just call the function with named parameters!
def print_model_output(self, data: List[List[int]]) -> None:
for output in self.use_model(data):
print('In about {} to {} days you will need to buy new {}.'.format(
math.floor(output), math.ceil(output), self.product))
This doesn't make sense. self.product
won't change through this loop. I suspect there shouldn't be a loop at all, as there's never more than one product / output.
def calculate_error_rates(self) -> Tuple[float]:
train_error = mean_absolute_error(
self.y_train, self.model.predict(self.x_train))
test_error = mean_absolute_error(
self.y_test, self.model.predict(self.x_test))
return train_error, test_error
This code strongly suggests that instead of two variables y_whatever
and x_whatever
you should instead have a 2-element tuple called whatever
. Then you could get rid of some of this duplicated code.
features_labels = np.array(['days_after_restock', 'left_content'])
Why an np.array
here? What's wrong with a regular list?
-
\$\begingroup\$ The if/else statement uses a local model if available otherwise it will create a new one. And the loop in print_model_output is needed because I can pass multiple Lists to it for example: EGGS.print_model_output([[2, 7], [7, 2]]). @Snowbody \$\endgroup\$Vera Perrone– Vera Perrone2017年12月08日 14:29:34 +00:00Commented Dec 8, 2017 at 14:29
-
\$\begingroup\$ But @VeraPerrone it you pass multiple lists the output will not make sense! You'll need eggs between 5 and 6 days, as well as between 2 and 3 days! And there's no way to connect these back to the inputs! \$\endgroup\$Snowbody– Snowbody2017年12月08日 19:12:54 +00:00Commented Dec 8, 2017 at 19:12
-
\$\begingroup\$ You are right that it should not make sense... I guess you can't have a different amount of items at the same time. lol @Snowbody \$\endgroup\$Vera Perrone– Vera Perrone2017年12月08日 19:18:47 +00:00Commented Dec 8, 2017 at 19:18
-
\$\begingroup\$ The only way it would make sense is if you passed a list of tuples where the first element is the product and the second is the usage history. Then extract the appropriate element from each tuple. \$\endgroup\$Snowbody– Snowbody2017年12月08日 19:39:47 +00:00Commented Dec 8, 2017 at 19:39
Explore related questions
See similar questions with these tags.