Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bamine/stacker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

23 Commits

Repository files navigation

stacker

Build Status Coverage Status

Stacker is a library for automating repetitive tasks in a Machine Learning process. It is specially thought for competitive Machine Learning.

Getting Started

Installing package

To install the package (as well as the requirements)

pip install git+https://github.com/bamine/stacker.git

Optimization

Let's start with a toy classification task

In [1]: from sklearn import datasets
In [2]: X, y = datasets.make_classification()

Next we need to define outTask instance, that describes the problem at hand and the evaluation method

In [3]: from stacker.optimization.task import Task
In [4]: task = Task(name="my_test_problem", X=X, y=y, task="classification", test_size=0.1, random_state=42)

We also need to define a scorer to evaluate the performance of our models, it will represent the function to minimize in our optimization

In [5]: from stacker.optimization.scorer import Scorer
In [6]: from sklearn.metrics import roc_auc_score
In [7]: scorer = Scorer(name="auc_error", scoring_function=lambda y_pred, y_true: 1 - roc_auc_score(y_pred, y_true))

That's it, that's all we need to start our optimization

In [8]: from stacker.optimization.optimizers import XGBoostOptimizer
In [9]: optimizer = XGBoostOptimizer(task=task, scorer=scorer)
In [10]: best = optimizer.start_optimization(max_evals=10)

We see optimization logs printing to the screen:

2016年10月30日 20:06:58,908 optimization INFO Started optimization for task Task=my_test_problem - type=classification - validation_method=train_test_split - len(X)=100
2016年10月30日 20:06:58,940 hyperopt.tpe INFO tpe_transform took 0.010630 seconds
2016年10月30日 20:06:58,940 hyperopt.tpe INFO TPE using 0 trials
2016年10月30日 20:06:58,944 optimization INFO Evaluating with test size 0.1 with parameters {'base_score': 0.15131750274218725, 'n_estimators': 1549, 'max_delta_step': 5.456151637154244, 'max_depth': 32, 'scale_pos_weight': 17.01641396909963, 'subsample': 0.5602243542512042, 'min_child_weight': 7.357276709287346, 'reg_lambda': 0.04892385490528424, 'colsample_bylevel': 0.9270255573832653, 'reg_alpha': 9.241575982835831, 'learning_rate': 0.334, 'gamma': 9.204287327296093, 'colsample_bytree': 0.7401493524610046}
2016年10月30日 20:06:58,944 optimization INFO Training model ...
2016年10月30日 20:06:59,445 optimization INFO Training model done !
2016年10月30日 20:06:59,449 optimization INFO Score = 0.5

After max_evals round of optimization, the variable best holds the best parameters for our model so far:

In [11]: best
Out[11]:
{'base_score': 0.11896730519023568,
 'colsample_bylevel': 0.7784020826729505,
 'colsample_bytree': 0.5406725444519914,
 'gamma': 2.8852788849467914,
 'learning_rate': 0.0458,
 'max_delta_step': 7.94440873152742,
 'max_depth': 40.54677208316948,
 'min_child_weight': 1.4917791224018282,
 'n_estimators': 4098.139372543198,
 'reg_alpha': 5.145060179230331,
 'reg_lambda': 4.275665709200242,
 'scale_pos_weight': 92.51593088136909,
 'subsample': 0.30804272389187076}

To see the performance of our best parameters:

In [16]: optimizer.score(best)
2016-10-30 20:09:58,918 optimization INFO Evaluating with test size 0.1 with parameters {'base_score': 0.11896730519023568, 'n_estimators': 4098, 'max_delta_step': 7.94440873152742, 'max_depth': 40, 'scale_pos_weight': 92.51593088136909, 'subsample': 0.30804272389187076, 'min_child_weight': 1.4917791224018282, 'reg_lambda': 4.275665709200242, 'colsample_bylevel': 0.7784020826729505, 'reg_alpha': 5.145060179230331, 'learning_rate': 0.0458, 'gamma': 2.8852788849467914, 'colsample_bytree': 0.5406725444519914}
2016-10-30 20:09:58,918 optimization INFO Training model ...
2016-10-30 20:10:00,227 optimization INFO Training model done !
2016-10-30 20:10:00,228 optimization INFO Score = 0.0
Out[16]: {'loss': 0.0, 'status': 'ok'}

Saving progress

File Logger

You can save the progress of the optimization process to flat files in json format, by creating a FileLogger object:

In [17]: from stacker.optimization.loggers import FileLogger
In [18]: file_logger = FileLogger(task=task)
In [19]: optimizer = XGBoostOptimizer(task=task, scorer=scorer, logger=file_logger)
In [20]: best = optimizer.start_optimization(max_evals=10)

The optimization progress logs are in a file called "tast.name".log which in out case is my_test_problem.log. Each line contains a json object which stores useful information.

{"scorer_name": "auc_error", 
"score": 0.0, 
"validation_method": "train_test_split", 
"predictions": [0.9586731195449829, 0.9586731195449829, ...], 
"model": "XGBModel(base_score=0.3547331680585909, ...)", 
"parameters": {"base_score": 0.3547331680585909, "colsample_bylevel": ...}, 
"task": "my_test_problem", 
"random_state": 42}

Predictions (on the holdout set for train-test splits and on each folds for cv validation) are also stored.

DB Logger

You can also store the optimization progress into a database. First we create an SQLAlchemy engine object.

In [21]: from sqlalchemy import create_engine
In [22]: engine = create_engine('sqlite:///test.db')

Then we create our DBLogger instance:

In [23]: from stacker.optimization.loggers import DBLogger
In [24]: db_logger = DBLogger(task=task, engine=engine)
In [25]: optimizer = XGBoostOptimizer(task=task, scorer=scorer, opt_logger=db_logger)
In [26]: best = optimizer.start_optimization(max_evals=10)

We inspect the contents of our database:

sqlite3 test.db
sqlite> .tables
my_test_problem_opt_table
sqlite> select * from my_test_problem_opt_table limit 1;
task|date_time|model|parameters|score|scorer_name|validation_method|predictions|random_state
my_test_problem|2016年10月30日 21:15:21.338861|XGBModel(base_score=0.4987515735616427, colsample_bylevel=0.9537358268329221,
 colsample_bytree=0.5540367970123756, gamma=3.9642149376691425,
 learning_rate=0.060700000000000004, max_delta_step=6.416649245434087,
 max_depth=39, min_child_weight=8.653862671384827, missing=None,
 n_estimators=2370, nthread=-1, objective='reg:linear',
 reg_alpha=17.94451568455059, reg_lambda=4.289513035748907,
 scale_pos_weight=48.14895822783563, seed=0, silent=True,
 subsample=0.9218555541314193)|{"base_score": 0.4987515735616427, ...

About

Utilities for competitive Machine Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle によって変換されたページ (->オリジナル) /