Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit f6af46f

Browse files
ravinkohlinabenabe0928
andauthored
[ADD] Documentation for data validation and preprocessing (#323)
* Address silly issues in documentation and add data validation and preprocessing * Fix flake * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * default value doc change * unify documentation throughout library * Update autoPyTorch/pipeline/components/training/metrics/base.py * Update base_task.py * Update tabular_classification.py * Update tabular_classification.py * Update tabular_regression.py * fix flake Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>
1 parent 96de622 commit f6af46f

File tree

13 files changed

+436
-282
lines changed

13 files changed

+436
-282
lines changed

‎autoPyTorch/api/base_task.py

Lines changed: 120 additions & 78 deletions
Large diffs are not rendered by default.

‎autoPyTorch/api/tabular_classification.py

Lines changed: 86 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -25,37 +25,42 @@
2525
class TabularClassificationTask(BaseTask):
2626
"""
2727
Tabular Classification API to the pipelines.
28+
2829
Args:
29-
seed (int), (default=1): seed to be used for reproducibility.
30-
n_jobs (int), (default=1): number of consecutive processes to spawn.
31-
n_threads (int), (default=1):
30+
seed (int: default=1):
31+
seed to be used for reproducibility.
32+
n_jobs (int: default=1):
33+
number of consecutive processes to spawn.
34+
n_threads (int: default=1):
3235
number of threads to use for each process.
3336
logging_config (Optional[Dict]):
34-
specifies configuration for logging, if None, it is loaded from the logging.yaml
35-
ensemble_size (int), (default=50):
37+
Specifies configuration for logging, if None, it is loaded from the logging.yaml
38+
ensemble_size (int: default=50):
3639
Number of models added to the ensemble built by
3740
Ensemble selection from libraries of models.
3841
Models are drawn with replacement.
39-
ensemble_nbest (int), (default=50):
40-
only consider the ensemble_nbest
42+
ensemble_nbest (int: default=50):
43+
Only consider the ensemble_nbest
4144
models to build the ensemble
42-
max_models_on_disc (int), (default=50):
43-
maximum number of models saved to disc.
44-
Also, controls the size of the ensemble as any additional models will be deleted.
45+
max_models_on_disc (int: default=50):
46+
Maximum number of models saved to disc.
47+
Also, controls the size of the ensemble
48+
as any additional models will be deleted.
4549
Must be greater than or equal to 1.
4650
temporary_directory (str):
47-
folder to store configuration output and log file
51+
Folder to store configuration output and log file
4852
output_directory (str):
49-
folder to store predictions for optional test set
53+
Folder to store predictions for optional test set
5054
delete_tmp_folder_after_terminate (bool):
51-
determines whether to delete the temporary directory, when finished
55+
Determines whether to delete the temporary directory,
56+
when finished
5257
include_components (Optional[Dict]):
53-
If None, all possible components are used. Otherwise
54-
specifies set of components to use.
58+
If None, all possible components are used.
59+
Otherwise specifies set of components to use.
5560
exclude_components (Optional[Dict]):
56-
If None, all possible components are used. Otherwise
57-
specifies set of components not to use. Incompatible
58-
with include components
61+
If None, all possible components are used.
62+
Otherwise specifies set of components not to use.
63+
Incompatible with include components.
5964
search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
6065
search space updates that can be used to modify the search
6166
space of particular components or choice modules of the pipeline
@@ -102,6 +107,16 @@ def __init__(
102107
)
103108

104109
def build_pipeline(self, dataset_properties: Dict[str, Any]) -> TabularClassificationPipeline:
110+
"""
111+
Build pipeline according to current task and for the passed dataset properties
112+
113+
Args:
114+
dataset_properties (Dict[str,Any])
115+
116+
Returns:
117+
TabularClassificationPipeline:
118+
Pipeline compatible with the given dataset properties.
119+
"""
105120
return TabularClassificationPipeline(dataset_properties=dataset_properties)
106121

107122
def search(
@@ -143,38 +158,38 @@ def search(
143158
budget_type (str):
144159
Type of budget to be used when fitting the pipeline.
145160
It can be one of:
146-
+ 'epochs': The training of each pipeline will be terminated after
147-
a number of epochs have passed. This number of epochs is determined by the
148-
budget argument of this method.
149-
+ 'runtime': The training of each pipeline will be terminated after
150-
a number of seconds have passed. This number of seconds is determined by the
151-
budget argument of this method. The overall fitting time of a pipeline is
152-
controlled by func_eval_time_limit_secs. 'runtime' only controls the allocated
153-
time to train a pipeline, but it does not consider the overall time it takes
154-
to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
155-
budget_type will determine the units of min_budget/max_budget. If budget_type=='epochs'
156-
is used, min_budget will refer to epochs whereas if budget_type=='runtime' then
157-
min_budget will refer to seconds.
161+
+ `epochs`: The training of each pipeline will be terminated after
162+
a number of epochs have passed. This number of epochs is determined by the
163+
budget argument of this method.
164+
+ `runtime`: The training of each pipeline will be terminated after
165+
a number of seconds have passed. This number of seconds is determined by the
166+
budget argument of this method. The overall fitting time of a pipeline is
167+
controlled by func_eval_time_limit_secs. 'runtime' only controls the allocated
168+
time to train a pipeline, but it does not consider the overall time it takes
169+
to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
170+
budget_type will determine the units of min_budget/max_budget. If budget_type=='epochs'
171+
is used, min_budget will refer to epochs whereas if budget_type=='runtime' then
172+
min_budget will refer to seconds.
158173
min_budget (int):
159-
Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>_` to
174+
Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>`_ to
160175
trade-off resources between running many pipelines at min_budget and
161176
running the top performing pipelines on max_budget.
162177
min_budget states the minimum resource allocation a pipeline should have
163178
so that we can compare and quickly discard bad performing models.
164179
For example, if the budget_type is epochs, and min_budget=5, then we will
165180
run every pipeline to a minimum of 5 epochs before performance comparison.
166181
max_budget (int):
167-
Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>_` to
182+
Auto-PyTorch uses `Hyperband <https://arxiv.org/abs/1603.06560>`_ to
168183
trade-off resources between running many pipelines at min_budget and
169184
running the top performing pipelines on max_budget.
170185
max_budget states the maximum resource allocation a pipeline is going to
171186
be ran. For example, if the budget_type is epochs, and max_budget=50,
172187
then the pipeline training will be terminated after 50 epochs.
173-
total_walltime_limit (int), (default=100): Time limit
174-
in seconds for the search of appropriate models.
188+
total_walltime_limit (int: default=100):
189+
Time limit in seconds for the search of appropriate models.
175190
By increasing this value, autopytorch has a higher
176191
chance of finding better models.
177-
func_eval_time_limit_secs (int), (default=None):
192+
func_eval_time_limit_secs (Optional[int]):
178193
Time limit for a single call to the machine learning model.
179194
Model fitting will be terminated if the machine
180195
learning algorithm runs over the time limit. Set
@@ -185,47 +200,54 @@ def search(
185200
total_walltime_limit // 2 to allow enough time to fit
186201
at least 2 individual machine learning algorithms.
187202
Set to np.inf in case no time limit is desired.
188-
enable_traditional_pipeline (bool), (default=True):
203+
enable_traditional_pipeline (bool: default=True):
189204
We fit traditional machine learning algorithms
190205
(LightGBM, CatBoost, RandomForest, ExtraTrees, KNN, SVM)
191-
before building PyTorch Neural Networks. You can disable this
206+
prior building PyTorch Neural Networks. You can disable this
192207
feature by turning this flag to False. All machine learning
193208
algorithms that are fitted during search() are considered for
194209
ensemble building.
195-
memory_limit (Optional[int]), (default=4096):
196-
Memory limit in MB for the machine learning algorithm. autopytorch
197-
will stop fitting the machine learning algorithm if it tries
198-
to allocate more than memory_limit MB. If None is provided,
199-
no memory limit is set. In case of multi-processing, memory_limit
200-
will be per job. This memory limit also applies to the ensemble
201-
creation process.
210+
memory_limit (Optional[int]: default=4096):
211+
Memory limit in MB for the machine learning algorithm.
212+
Autopytorch will stop fitting the machine learning algorithm
213+
if it tries to allocate more than memory_limit MB. If None
214+
is provided, no memory limit is set. In case of multi-processing,
215+
memory_limit will be per job. This memory limit also applies to
216+
the ensemble creation process.
202217
smac_scenario_args (Optional[Dict]):
203218
Additional arguments inserted into the scenario of SMAC. See the
204-
[SMAC documentation] (https://automl.github.io/SMAC3/master/options.html?highlight=scenario#scenario)
219+
`SMAC documentation <https://automl.github.io/SMAC3/master/options.html?highlight=scenario#scenario>`_
220+
for a list of available arguments.
205221
get_smac_object_callback (Optional[Callable]):
206222
Callback function to create an object of class
207-
[smac.optimizer.smbo.SMBO](https://automl.github.io/SMAC3/master/apidoc/smac.optimizer.smbo.html).
223+
`smac.optimizer.smbo.SMBO <https://automl.github.io/SMAC3/master/apidoc/smac.optimizer.smbo.html>`_.
208224
The function must accept the arguments scenario_dict,
209225
instances, num_params, runhistory, seed and ta. This is
210226
an advanced feature. Use only if you are familiar with
211-
[SMAC](https://automl.github.io/SMAC3/master/index.html).
212-
all_supported_metrics (bool), (default=True):
213-
if True, all metrics supporting current task will be calculated
227+
`SMAC <https://automl.github.io/SMAC3/master/index.html>`_.
228+
tae_func (Optional[Callable]):
229+
TargetAlgorithm to be optimised. If None, `eval_function`
230+
available in autoPyTorch/evaluation/train_evaluator is used.
231+
Must be child class of AbstractEvaluator.
232+
all_supported_metrics (bool: default=True):
233+
If True, all metrics supporting current task will be calculated
214234
for each pipeline and results will be available via cv_results
215-
precision (int), (default=32): Numeric precision used when loading
216-
ensemble data. Can be either '16', '32' or '64'.
235+
precision (int: default=32):
236+
Numeric precision used when loading ensemble data.
237+
Can be either '16', '32' or '64'.
217238
disable_file_output (Union[bool, List]):
218-
load_models (bool), (default=True):
239+
load_models (bool: default=True):
219240
Whether to load the models after fitting AutoPyTorch.
220-
portfolio_selection (str), (default=None):
241+
portfolio_selection (Optional[str]):
221242
This argument controls the initial configurations that
222243
AutoPyTorch uses to warm start SMAC for hyperparameter
223244
optimization. By default, no warm-starting happens.
224245
The user can provide a path to a json file containing
225246
configurations, similar to (...herepathtogreedy...).
226247
Additionally, the keyword 'greedy' is supported,
227248
which would use the default portfolio from
228-
`AutoPyTorch Tabular <https://arxiv.org/abs/2006.13799>`
249+
`AutoPyTorch Tabular <https://arxiv.org/abs/2006.13799>`_.
250+
229251
Returns:
230252
self
231253
@@ -281,6 +303,16 @@ def predict(
281303
batch_size: Optional[int] = None,
282304
n_jobs: int = 1
283305
) -> np.ndarray:
306+
"""Generate the estimator predictions.
307+
Generate the predictions based on the given examples from the test set.
308+
309+
Args:
310+
X_test (np.ndarray):
311+
The test set examples.
312+
313+
Returns:
314+
Array with estimator predictions.
315+
"""
284316
if self.InputValidator is None or not self.InputValidator._is_fitted:
285317
raise ValueError("predict() is only supported after calling search. Kindly call first "
286318
"the estimator fit() method.")

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /