TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

TFDS CLI

TFDS CLI is a command-line tool that provides various commands to easily work with TensorFlow Datasets.

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook
Disable TF logs on import
%%capture
%env TF_CPP_MIN_LOG_LEVEL=1 # Disable logs on TF import

Installation

The CLI tool is installed with tensorflow-datasets (or tfds-nightly).

pipinstall-qtfds-nightlyapache-beam
tfds--version

For the list of all CLI commands:

tfds--help
usage: tfds [-h] [--helpfull] [--version] {build,new} ...
Tensorflow Datasets CLI tool
optional arguments:
 -h, --help show this help message and exit
 --helpfull show full help message and exit
 --version show program's version number and exit
command:
 {build,new}
 build Commands for downloading and preparing datasets.
 new Creates a new dataset directory from the template.

tfds new: Implementing a new Dataset

This command will help you kickstart writing your new Python dataset by creating a <dataset_name>/ directory containing default implementation files.

Usage:

tfdsnewmy_dataset
Dataset generated at /tmpfs/src/temp/docs/my_dataset
You can start searching `TODO(my_dataset)` to complete the implementation.
Please check https://www.tensorflow.org/datasets/add_dataset for additional details.

tfds new my_dataset will create:

ls -1 my_dataset/
CITATIONS.bib
README.md
TAGS.txt
__init__.py
checksums.tsv
dummy_data/
my_dataset_dataset_builder.py
my_dataset_dataset_builder_test.py

An optional flag --data_format can be used to generate format-specific dataset builders (e.g., conll). If no data format is given, it will generate a template for a standard tfds.core.GeneratorBasedBuilder. Refer to the documentation for details on the available format-specific dataset builders.

See our writing dataset guide for more info.

Available options:

tfdsnew--help
usage: tfds new [-h] [--helpfull] [--data_format {standard,conll,conllu}]
 [--dir DIR]
 dataset_name
positional arguments:
 dataset_name Name of the dataset to be created (in snake_case)
optional arguments:
 -h, --help show this help message and exit
 --helpfull show full help message and exit
 --data_format {standard,conll,conllu}
 Optional format of the input data, which is used to
 generate a format-specific template.
 --dir DIR Path where the dataset directory will be created.
 Defaults to current directory.

tfds build: Download and prepare a dataset

Use tfds build <my_dataset> to generate a new dataset. <my_dataset> can be:

  • A path to dataset/ folder or dataset.py file (empty for current directory):

    • tfds build datasets/my_dataset/
    • cd datasets/my_dataset/ && tfds build
    • cd datasets/my_dataset/ && tfds build my_dataset
    • cd datasets/my_dataset/ && tfds build my_dataset.py
  • A registered dataset:

    • tfds build mnist
    • tfds build my_dataset --imports my_project.datasets

Available options:

tfdsbuild--help
usage: tfds build [-h] [--helpfull]
 [--datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]]
 [--overwrite] [--fail_if_exists]
 [--max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]]
 [--data_dir DATA_DIR] [--download_dir DOWNLOAD_DIR]
 [--extract_dir EXTRACT_DIR] [--manual_dir MANUAL_DIR]
 [--add_name_to_manual_dir] [--download_only]
 [--config CONFIG] [--config_idx CONFIG_IDX]
 [--update_metadata_only] [--download_config DOWNLOAD_CONFIG]
 [--imports IMPORTS] [--register_checksums]
 [--force_checksums_validation]
 [--noforce_checksums_validation]
 [--beam_pipeline_options BEAM_PIPELINE_OPTIONS]
 [--file_format FILE_FORMAT]
 [--max_shard_size_mb MAX_SHARD_SIZE_MB]
 [--num-processes NUM_PROCESSES] [--publish_dir PUBLISH_DIR]
 [--skip_if_published] [--exclude_datasets EXCLUDE_DATASETS]
 [--experimental_latest_version]
 [datasets ...]
positional arguments:
 datasets Name(s) of the dataset(s) to build. Default to current
 dir. See https://www.tensorflow.org/datasets/cli for
 accepted values.
optional arguments:
 -h, --help show this help message and exit
 --helpfull show full help message and exit
 --datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]
 Datasets can also be provided as keyword argument.
Debug & tests:
 --pdb Enter post-mortem debugging mode if an exception is raised.
 --overwrite Delete pre-existing dataset if it exists.
 --fail_if_exists Fails the program if there is a pre-existing dataset.
 --max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]
 When set, only generate the first X examples (default
 to 1), rather than the full dataset.If set to 0, only
 execute the `_split_generators` (which download the
 original data), but skip `_generator_examples`
Paths:
 --data_dir DATA_DIR Where to place datasets. Default to
 `~/tensorflow_datasets/` or `TFDS_DATA_DIR`
 environement variable.
 --download_dir DOWNLOAD_DIR
 Where to place downloads. Default to
 `<data_dir>/downloads/`.
 --extract_dir EXTRACT_DIR
 Where to extract files. Default to
 `<download_dir>/extracted/`.
 --manual_dir MANUAL_DIR
 Where to manually download data (required for some
 datasets). Default to `<download_dir>/manual/`.
 --add_name_to_manual_dir
 If true, append the dataset name to the `manual_dir`
 (e.g. `<download_dir>/manual/<dataset_name>/`. Useful
 to avoid collisions if many datasets are generated.
Generation:
 --download_only If True, download all files but do not prepare the
 dataset. Uses the checksum.tsv to find out what to
 download. Therefore, this does not work in combination
 with --register_checksums.
 --config CONFIG, -c CONFIG
 Config name to build. Build all configs if not set.
 Can also be a json of the kwargs forwarded to the
 config `__init__` (for custom configs).
 --config_idx CONFIG_IDX
 Config id to build
 (`builder_cls.BUILDER_CONFIGS[config_idx]`). Mutually
 exclusive with `--config`.
 --update_metadata_only
 If True, existing dataset_info.json is updated with
 metadata defined in Builder class(es). Datasets must
 already have been prepared.
 --download_config DOWNLOAD_CONFIG
 A json of the kwargs forwarded to the config
 `__init__` (for custom DownloadConfigs).
 --imports IMPORTS, -i IMPORTS
 Comma separated list of module to import to register
 datasets.
 --register_checksums If True, store size and checksum of downloaded files.
 --force_checksums_validation
 If True, raise an error if the checksums are not
 found.
 --noforce_checksums_validation
 If specified, bypass the checks on the checksums.
 --beam_pipeline_options BEAM_PIPELINE_OPTIONS
 A (comma-separated) list of flags to pass to
 `PipelineOptions` when preparing with Apache Beam.
 (see:
 https://www.tensorflow.org/datasets/beam_datasets).
 Example: `--beam_pipeline_options=job_name=my-
 job,project=my-project`
 --file_format FILE_FORMAT
 File format to which generate the tf-examples.
 Available values: ['tfrecord', 'riegeli',
 'array_record'] (see `tfds.core.FileFormat`).
 --max_shard_size_mb MAX_SHARD_SIZE_MB
 The max shard size in megabytes.
 --num-processes NUM_PROCESSES
 Number of parallel build processes.
Publishing:
 Options for publishing successfully created datasets.
 --publish_dir PUBLISH_DIR
 Where to optionally publish the dataset after it has
 been generated successfully. Should be the root data
 dir under whichdatasets are stored. If unspecified,
 dataset will not be published
 --skip_if_published If the dataset with the same version and config is
 already published, then it will not be regenerated.
Automation:
 Used by automated scripts.
 --exclude_datasets EXCLUDE_DATASETS
 If set, generate all datasets except the one defined
 here. Comma separated list of datasets to exclude.
 --experimental_latest_version
 Build the latest Version(experiments=...) available
 rather than default version.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月04日 UTC.