The ML.TFDV_DESCRIBE function

This document describes the ML.TFDV_DESCRIBE function, which you can use to generate fine-grained statistics for the columns in a table. For example, you might want to know statistics for a table of training or serving data statistics that you plan to use with a machine learning (ML) model. Calling this function provides the same behavior as calling the TensorFlow TensorFlow tfdv.generate_statistics_from_csv API. You can use the data output by this function for such purposes as feature preprocessing or model monitoring.

Syntax

ML.TFDV_DESCRIBE(
{TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},
STRUCT(
[NUM_HISTOGRAM_BUCKETSASnum_histogram_buckets]
[,NUM_QUANTILES_HISTOGRAM_BUCKETSASnum_quantiles_histogram_buckets]
[,NUM_VALUES_HISTOGRAM_BUCKETSASnum_values_histogram_buckets]
[,NUM_RANK_HISTOGRAM_BUCKETSASnum_rank_histogram_buckets])
)

Arguments

ML.TFDV_DESCRIBE takes the following arguments:

  • PROJECT_ID: your project ID.
  • DATASET: the BigQuery dataset that contains the table.
  • TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for.
  • QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the QUERY_STATEMENT clause, see GoogleSQL query syntax.
  • NUM_HISTOGRAM_BUCKETS: an INT64 value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns. The num_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • NUM_QUANTILES_HISTOGRAM_BUCKETS: an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns. The num_quantiles_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • NUM_VALUES_HISTOGRAM_BUCKETS: an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to ARRAY columns. The num_values_histogram_buckets value must be in the range [1, 1,000]. The default value is 10.
  • NUM_RANK_HISTOGRAM_BUCKETS: an INT64 value that specifies the number of buckets to use for a rank histogram. Only applies to categorical and ARRAY<categorical> columns. The num_rank_histogram_buckets value must be in the range [1, 10,000]. The default value is 50.

Output

ML.TFDV_DESCRIBE returns a column named dataset_feature_statistics_list that contains a TensorFlow DatasetFeatureStatisticsList protocol buffer in JSON format.

Example

The following example returns statistics for the penguins public dataset and uses 20 buckets for rank histograms for string values:

SELECT*FROMML.TFDV_DESCRIBE(
TABLE`bigquery-public-data.ml_datasets.penguins`,
STRUCT(20ASnum_rank_histogram_buckets)
);

Limitations

Input data for the ML.TFDV_DESCRIBE function can only contain columns of the following data types:

  • Numeric types
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP

What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月24日 UTC.