Commit f6af46f

ravinkohlinabenabe0928

and

authored

[ADD] Documentation for data validation and preprocessing (#323)

* Address silly issues in documentation and add data validation and preprocessing * Fix flake * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * default value doc change * unify documentation throughout library * Update autoPyTorch/pipeline/components/training/metrics/base.py * Update base_task.py * Update tabular_classification.py * Update tabular_classification.py * Update tabular_regression.py * fix flake Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

1 parent 96de622 commit f6af46fCopy full SHA for f6af46f

File tree

13 files changed

+436

-282

lines changed

autoPyTorch
- api
- datasets
- optimizer
  - smbo.py
- pipeline/components
  - setup/traditional_ml/traditional_learner
    - base_traditional_learner.py
  - training/metrics
    - base.py
- utils
  - pipeline.py
docs

13 files changed

+436

-282

lines changed

`‎autoPyTorch/api/base_task.py`

Lines changed: 120 additions & 78 deletions

Large diffs are not rendered by default.

`‎autoPyTorch/api/tabular_classification.py`

Lines changed: 86 additions & 54 deletions

Original file line number	Diff line number	Diff line change
`@@ -25,37 +25,42 @@`
`25`	`25`	`class TabularClassificationTask(BaseTask):`
`26`	`26`	`"""`
`27`	`27`	`Tabular Classification API to the pipelines.`
	`28`	`+`
`28`	`29`	`Args:`
`29`		`- seed (int), (default=1): seed to be used for reproducibility.`
`30`		`- n_jobs (int), (default=1): number of consecutive processes to spawn.`
`31`		`- n_threads (int), (default=1):`
	`30`	`+ seed (int: default=1):`
	`31`	`+ seed to be used for reproducibility.`
	`32`	`+ n_jobs (int: default=1):`
	`33`	`+ number of consecutive processes to spawn.`
	`34`	`+ n_threads (int: default=1):`
`32`	`35`	`number of threads to use for each process.`
`33`	`36`	`logging_config (Optional[Dict]):`
`34`		`- specifies configuration for logging, if None, it is loaded from the logging.yaml`
`35`		`- ensemble_size (int), (default=50):`
	`37`	`+ Specifies configuration for logging, if None, it is loaded from the logging.yaml`
	`38`	`+ ensemble_size (int: default=50):`
`36`	`39`	`Number of models added to the ensemble built by`
`37`	`40`	`Ensemble selection from libraries of models.`
`38`	`41`	`Models are drawn with replacement.`
`39`		`- ensemble_nbest (int), (default=50):`
`40`		`- only consider the ensemble_nbest`
	`42`	`+ ensemble_nbest (int: default=50):`
	`43`	`+ Only consider the ensemble_nbest`
`41`	`44`	`models to build the ensemble`
`42`		`- max_models_on_disc (int), (default=50):`
`43`		`- maximum number of models saved to disc.`
`44`		`- Also, controls the size of the ensemble as any additional models will be deleted.`
	`45`	`+ max_models_on_disc (int: default=50):`
	`46`	`+ Maximum number of models saved to disc.`
	`47`	`+ Also, controls the size of the ensemble`
	`48`	`+ as any additional models will be deleted.`
`45`	`49`	`Must be greater than or equal to 1.`
`46`	`50`	`temporary_directory (str):`
`47`		`- folder to store configuration output and log file`
	`51`	`+ Folder to store configuration output and log file`
`48`	`52`	`output_directory (str):`
`49`		`- folder to store predictions for optional test set`
	`53`	`+ Folder to store predictions for optional test set`
`50`	`54`	`delete_tmp_folder_after_terminate (bool):`
`51`		`- determines whether to delete the temporary directory, when finished`
	`55`	`+ Determines whether to delete the temporary directory,`
	`56`	`+ when finished`
`52`	`57`	`include_components (Optional[Dict]):`
`53`		`- If None, all possible components are used. Otherwise`
`54`		`- specifies set of components to use.`
	`58`	`+ If None, all possible components are used.`
	`59`	`+ Otherwise specifies set of components to use.`
`55`	`60`	`exclude_components (Optional[Dict]):`
`56`		`- If None, all possible components are used. Otherwise`
`57`		`- specifies set of components not to use. Incompatible`
`58`		`- with include components`
	`61`	`+ If None, all possible components are used.`
	`62`	`+ Otherwise specifies set of components not to use.`
	`63`	`+ Incompatible with include components.`
`59`	`64`	`search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):`
`60`	`65`	`search space updates that can be used to modify the search`
`61`	`66`	`space of particular components or choice modules of the pipeline`
`@@ -102,6 +107,16 @@ def __init__(`
`102`	`107`	`)`
`103`	`108`
`104`	`109`	`def build_pipeline(self, dataset_properties: Dict[str, Any]) -> TabularClassificationPipeline:`
	`110`	`+ """`
	`111`	`+ Build pipeline according to current task and for the passed dataset properties`
	`112`	`+`
	`113`	`+ Args:`
	`114`	`+ dataset_properties (Dict[str,Any])`
	`115`	`+`
	`116`	`+ Returns:`
	`117`	`+ TabularClassificationPipeline:`
	`118`	`+ Pipeline compatible with the given dataset properties.`
	`119`	`+ """`
`105`	`120`	`return TabularClassificationPipeline(dataset_properties=dataset_properties)`
`106`	`121`
`107`	`122`	`def search(`
`@@ -143,38 +158,38 @@ def search(`
`143`	`158`	`budget_type (str):`
`144`	`159`	`Type of budget to be used when fitting the pipeline.`
`145`	`160`	`It can be one of:`
`146`		`- + 'epochs': The training of each pipeline will be terminated after`
`147`		`- a number of epochs have passed. This number of epochs is determined by the`
`148`		`- budget argument of this method.`
`149`		`- + 'runtime': The training of each pipeline will be terminated after`
`150`		`- a number of seconds have passed. This number of seconds is determined by the`
`151`		`- budget argument of this method. The overall fitting time of a pipeline is`
`152`		`- controlled by func_eval_time_limit_secs. 'runtime' only controls the allocated`
`153`		`- time to train a pipeline, but it does not consider the overall time it takes`
`154`		`- to create a pipeline (data loading and preprocessing, other i/o operations, etc.).`
`155`		`- budget_type will determine the units of min_budget/max_budget. If budget_type=='epochs'`
`156`		`- is used, min_budget will refer to epochs whereas if budget_type=='runtime' then`
`157`		`- min_budget will refer to seconds.`
	`161`	+ + `epochs`: The training of each pipeline will be terminated after
	`162`	`+ a number of epochs have passed. This number of epochs is determined by the`
	`163`	`+ budget argument of this method.`
	`164`	+ + `runtime`: The training of each pipeline will be terminated after
	`165`	`+ a number of seconds have passed. This number of seconds is determined by the`
	`166`	`+ budget argument of this method. The overall fitting time of a pipeline is`
	`167`	`+ controlled by func_eval_time_limit_secs. 'runtime' only controls the allocated`
	`168`	`+ time to train a pipeline, but it does not consider the overall time it takes`
	`169`	`+ to create a pipeline (data loading and preprocessing, other i/o operations, etc.).`
	`170`	`+ budget_type will determine the units of min_budget/max_budget. If budget_type=='epochs'`
	`171`	`+ is used, min_budget will refer to epochs whereas if budget_type=='runtime' then`
	`172`	`+ min_budget will refer to seconds.`
`158`	`173`	`min_budget (int):`
`159`		- Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>_` to
	`174`	+ Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>`_ to
`160`	`175`	`trade-off resources between running many pipelines at min_budget and`
`161`	`176`	`running the top performing pipelines on max_budget.`
`162`	`177`	`min_budget states the minimum resource allocation a pipeline should have`
`163`	`178`	`so that we can compare and quickly discard bad performing models.`
`164`	`179`	`For example, if the budget_type is epochs, and min_budget=5, then we will`
`165`	`180`	`run every pipeline to a minimum of 5 epochs before performance comparison.`
`166`	`181`	`max_budget (int):`
`167`		- Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>_` to
	`182`	+ Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>`_ to
`168`	`183`	`trade-off resources between running many pipelines at min_budget and`
`169`	`184`	`running the top performing pipelines on max_budget.`
`170`	`185`	`max_budget states the maximum resource allocation a pipeline is going to`
`171`	`186`	`be ran. For example, if the budget_type is epochs, and max_budget=50,`
`172`	`187`	`then the pipeline training will be terminated after 50 epochs.`
`173`		`- total_walltime_limit (int), (default=100): Time limit`
`174`		`- in seconds for the search of appropriate models.`
	`188`	`+ total_walltime_limit (int: default=100):`
	`189`	`+ Time limit in seconds for the search of appropriate models.`
`175`	`190`	`By increasing this value, autopytorch has a higher`
`176`	`191`	`chance of finding better models.`
`177`		`- func_eval_time_limit_secs (int), (default=None):`
	`192`	`+ func_eval_time_limit_secs (Optional[int]):`
`178`	`193`	`Time limit for a single call to the machine learning model.`
`179`	`194`	`Model fitting will be terminated if the machine`
`180`	`195`	`learning algorithm runs over the time limit. Set`
`@@ -185,47 +200,54 @@ def search(`
`185`	`200`	`total_walltime_limit // 2 to allow enough time to fit`
`186`	`201`	`at least 2 individual machine learning algorithms.`
`187`	`202`	`Set to np.inf in case no time limit is desired.`
`188`		`- enable_traditional_pipeline (bool), (default=True):`
	`203`	`+ enable_traditional_pipeline (bool: default=True):`
`189`	`204`	`We fit traditional machine learning algorithms`
`190`	`205`	`(LightGBM, CatBoost, RandomForest, ExtraTrees, KNN, SVM)`
`191`		`- before building PyTorch Neural Networks. You can disable this`
	`206`	`+ prior building PyTorch Neural Networks. You can disable this`
`192`	`207`	`feature by turning this flag to False. All machine learning`
`193`	`208`	`algorithms that are fitted during search() are considered for`
`194`	`209`	`ensemble building.`
`195`		`- memory_limit (Optional[int]), (default=4096):`
`196`		`- Memory limit in MB for the machine learning algorithm. autopytorch`
`197`		`- will stop fitting the machine learning algorithm if it tries`
`198`		`- to allocate more than memory_limit MB. If None is provided,`
`199`		`- no memory limit is set. In case of multi-processing, memory_limit`
`200`		`- will be per job. This memory limit also applies to the ensemble`
`201`		`- creation process.`
	`210`	`+ memory_limit (Optional[int]: default=4096):`
	`211`	`+ Memory limit in MB for the machine learning algorithm.`
	`212`	`+ Autopytorch will stop fitting the machine learning algorithm`
	`213`	`+ if it tries to allocate more than memory_limit MB. If None`
	`214`	`+ is provided, no memory limit is set. In case of multi-processing,`
	`215`	`+ memory_limit will be per job. This memory limit also applies to`
	`216`	`+ the ensemble creation process.`
`202`	`217`	`smac_scenario_args (Optional[Dict]):`
`203`	`218`	`Additional arguments inserted into the scenario of SMAC. See the`
`204`		`- [SMAC documentation] (https://automl.github.io/SMAC3/master/options.html?highlight=scenario#scenario)`
	`219`	+ `SMAC documentation <https://automl.github.io/SMAC3/master/options.html?highlight=scenario#scenario>`_
	`220`	`+ for a list of available arguments.`
`205`	`221`	`get_smac_object_callback (Optional[Callable]):`
`206`	`222`	`Callback function to create an object of class`
`207`		`- [smac.optimizer.smbo.SMBO](https://automl.github.io/SMAC3/master/apidoc/smac.optimizer.smbo.html).`
	`223`	+ `smac.optimizer.smbo.SMBO <https://automl.github.io/SMAC3/master/apidoc/smac.optimizer.smbo.html>`_.
`208`	`224`	`The function must accept the arguments scenario_dict,`
`209`	`225`	`instances, num_params, runhistory, seed and ta. This is`
`210`	`226`	`an advanced feature. Use only if you are familiar with`
`211`		`- [SMAC](https://automl.github.io/SMAC3/master/index.html).`
`212`		`- all_supported_metrics (bool), (default=True):`
`213`		`- if True, all metrics supporting current task will be calculated`
	`227`	+ `SMAC <https://automl.github.io/SMAC3/master/index.html>`_.
	`228`	`+ tae_func (Optional[Callable]):`
	`229`	+ TargetAlgorithm to be optimised. If None, `eval_function`
	`230`	`+ available in autoPyTorch/evaluation/train_evaluator is used.`
	`231`	`+ Must be child class of AbstractEvaluator.`
	`232`	`+ all_supported_metrics (bool: default=True):`
	`233`	`+ If True, all metrics supporting current task will be calculated`
`214`	`234`	`for each pipeline and results will be available via cv_results`
`215`		`- precision (int), (default=32): Numeric precision used when loading`
`216`		`- ensemble data. Can be either '16', '32' or '64'.`
	`235`	`+ precision (int: default=32):`
	`236`	`+ Numeric precision used when loading ensemble data.`
	`237`	`+ Can be either '16', '32' or '64'.`
`217`	`238`	`disable_file_output (Union[bool, List]):`
`218`		`- load_models (bool), (default=True):`
	`239`	`+ load_models (bool: default=True):`
`219`	`240`	`Whether to load the models after fitting AutoPyTorch.`
`220`		`- portfolio_selection (str), (default=None):`
	`241`	`+ portfolio_selection (Optional[str]):`
`221`	`242`	`This argument controls the initial configurations that`
`222`	`243`	`AutoPyTorch uses to warm start SMAC for hyperparameter`
`223`	`244`	`optimization. By default, no warm-starting happens.`
`224`	`245`	`The user can provide a path to a json file containing`
`225`	`246`	`configurations, similar to (...herepathtogreedy...).`
`226`	`247`	`Additionally, the keyword 'greedy' is supported,`
`227`	`248`	`which would use the default portfolio from`
`228`		- `AutoPyTorch Tabular <https://arxiv.org/abs/2006.13799>`
	`249`	+ `AutoPyTorch Tabular <https://arxiv.org/abs/2006.13799>`_.
	`250`	`+`
`229`	`251`	`Returns:`
`230`	`252`	`self`
`231`	`253`
`@@ -281,6 +303,16 @@ def predict(`
`281`	`303`	`batch_size: Optional[int] = None,`
`282`	`304`	`n_jobs: int = 1`
`283`	`305`	`) -> np.ndarray:`
	`306`	`+ """Generate the estimator predictions.`
	`307`	`+ Generate the predictions based on the given examples from the test set.`
	`308`	`+`
	`309`	`+ Args:`
	`310`	`+ X_test (np.ndarray):`
	`311`	`+ The test set examples.`
	`312`	`+`
	`313`	`+ Returns:`
	`314`	`+ Array with estimator predictions.`
	`315`	`+ """`
`284`	`316`	`if self.InputValidator is None or not self.InputValidator._is_fitted:`
`285`	`317`	`raise ValueError("predict() is only supported after calling search. Kindly call first "`
`286`	`318`	`"the estimator fit() method.")`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit f6af46f

File tree

13 files changed

13 files changed

`‎autoPyTorch/api/base_task.py`

`‎autoPyTorch/api/tabular_classification.py`

0 commit comments