Cloud Storage to Cloud Storage template

Use the Serverless for Apache Spark Cloud Storage to Cloud Storage template to extract data from Cloud Storage to Cloud Storage.

Use the template

Run the template using the gcloud CLI or Dataproc API.

gcloud

Before using any of the command data below, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
  • REGION: Required. Compute Engine region.
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • TEMPLATE_VERSION: Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023年03月17日_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • CLOUD_STORAGE_INPUT_PATH: Required. Cloud Storage path from where input data will be read.

    Example: gs://example-bucket/example-folder/

  • FORMAT: Required. Input data format. Options: avro, parquet, or orc. Note: If avro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • CLOUD_STORAGE_OUTPUT_PATH: Required. Cloud Storage path where output will be stored.

    Example: gs://example-bucket/example-folder/

  • OUTPUT_FILE_FORMAT: Required. Output data format. Options: avro, csv parquet, json or orc. Note: If avro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • MODE: Required. Write mode for Cloud Storage output. Options: Append, Overwrite, Ignore, or ErrorIfExists.
  • TEMP_TABLE and TEMP_QUERY: Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into Cloud Storage. TEMP_TABLE is the temporary view name, and TEMP_QUERY is the query statement. TEMP_TABLE and the table name in TEMP_QUERY must match.
  • SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
  • PROPERTY and PROPERTY_VALUE: Optional. Comma-separated list of Spark property=value pairs.
  • LABEL and LABEL_VALUE: Optional. Comma-separated list of label=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, or WARN. Default: INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.

    Example: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

Execute the following command:

Linux, macOS, or Cloud Shell

gclouddataprocbatchessubmitspark\
--class=com.google.cloud.dataproc.templates.main.DataProcTemplate\
--version="1.2"\
--project="PROJECT_ID"\
--region="REGION"\
--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar,file:///usr/lib/spark/connector/spark-avro.jar"\
--subnet="SUBNET"\
--kms-key="KMS_KEY"\
--service-account="SERVICE_ACCOUNT"\
--properties="PROPERTY=PROPERTY_VALUE"\
--labels="LABEL=LABEL_VALUE"\
----template=GCSTOGCS\
--templatePropertylog.level="LOG_LEVEL"\
--templatePropertyproject.id="PROJECT_ID"\
--templatePropertygcs.gcs.input.location="CLOUD_STORAGE_INPUT_PATH"\
--templatePropertygcs.gcs.input.format="INPUT_FILE_FORMAT"\
--templatePropertygcs.gcs.output.location="CLOUD_STORAGE_OUTPUT_PATH"\
--templatePropertygcs.gcs.output.format="OUTPUT_FILE_FORMAT"\
--templatePropertygcs.gcs.write.mode="MODE"\
--templatePropertygcs.gcs.temp.table="TEMP_TABLE"\
--templatePropertygcs.gcs.temp.query="TEMP_QUERY"

Windows (PowerShell)

gclouddataprocbatchessubmitspark`
--class=com.google.cloud.dataproc.templates.main.DataProcTemplate`
--version="1.2"`
--project="PROJECT_ID"`
--region="REGION"`
--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar,file:///usr/lib/spark/connector/spark-avro.jar"`
--subnet="SUBNET"`
--kms-key="KMS_KEY"`
--service-account="SERVICE_ACCOUNT"`
--properties="PROPERTY=PROPERTY_VALUE"`
--labels="LABEL=LABEL_VALUE"`
----template=GCSTOGCS`
--templatePropertylog.level="LOG_LEVEL"`
--templatePropertyproject.id="PROJECT_ID"`
--templatePropertygcs.gcs.input.location="CLOUD_STORAGE_INPUT_PATH"`
--templatePropertygcs.gcs.input.format="INPUT_FILE_FORMAT"`
--templatePropertygcs.gcs.output.location="CLOUD_STORAGE_OUTPUT_PATH"`
--templatePropertygcs.gcs.output.format="OUTPUT_FILE_FORMAT"`
--templatePropertygcs.gcs.write.mode="MODE"`
--templatePropertygcs.gcs.temp.table="TEMP_TABLE"`
--templatePropertygcs.gcs.temp.query="TEMP_QUERY"

Windows (cmd.exe)

gclouddataprocbatchessubmitspark^
--class=com.google.cloud.dataproc.templates.main.DataProcTemplate^
--version="1.2"^
--project="PROJECT_ID"^
--region="REGION"^
--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar,file:///usr/lib/spark/connector/spark-avro.jar"^
--subnet="SUBNET"^
--kms-key="KMS_KEY"^
--service-account="SERVICE_ACCOUNT"^
--properties="PROPERTY=PROPERTY_VALUE"^
--labels="LABEL=LABEL_VALUE"^
----template=GCSTOGCS^
--templatePropertylog.level="LOG_LEVEL"^
--templatePropertyproject.id="PROJECT_ID"^
--templatePropertygcs.gcs.input.location="CLOUD_STORAGE_INPUT_PATH"^
--templatePropertygcs.gcs.input.format="INPUT_FILE_FORMAT"^
--templatePropertygcs.gcs.output.location="CLOUD_STORAGE_OUTPUT_PATH"^
--templatePropertygcs.gcs.output.format="OUTPUT_FILE_FORMAT"^
--templatePropertygcs.gcs.write.mode="MODE"^
--templatePropertygcs.gcs.temp.table="TEMP_TABLE"^
--templatePropertygcs.gcs.temp.query="TEMP_QUERY"

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
  • REGION: Required. Compute Engine region.
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • TEMPLATE_VERSION: Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023年03月17日_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • CLOUD_STORAGE_INPUT_PATH: Required. Cloud Storage path from where input data will be read.

    Example: gs://example-bucket/example-folder/

  • FORMAT: Required. Input data format. Options: avro, parquet, or orc. Note: If avro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • CLOUD_STORAGE_OUTPUT_PATH: Required. Cloud Storage path where output will be stored.

    Example: gs://example-bucket/example-folder/

  • OUTPUT_FILE_FORMAT: Required. Output data format. Options: avro, csv parquet, json or orc. Note: If avro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • MODE: Required. Write mode for Cloud Storage output. Options: Append, Overwrite, Ignore, or ErrorIfExists.
  • TEMP_TABLE and TEMP_QUERY: Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into Cloud Storage. TEMP_TABLE is the temporary view name, and TEMP_QUERY is the query statement. TEMP_TABLE and the table name in TEMP_QUERY must match.
  • SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
  • PROPERTY and PROPERTY_VALUE: Optional. Comma-separated list of Spark property=value pairs.
  • LABEL and LABEL_VALUE: Optional. Comma-separated list of label=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, or WARN. Default: INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.

    Example: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

Request JSON body:

{
 "environmentConfig":{
 "executionConfig":{
 "subnetworkUri":"SUBNET",
 "kmsKey": "KMS_KEY",
 "serviceAccount": "SERVICE_ACCOUNT"
 }
 },
 "labels": {
 "LABEL": "LABEL_VALUE"
 },
 "runtimeConfig": {
 "version": "1.2",
 "properties": {
 "PROPERTY": "PROPERTY_VALUE"
 }
 },
 "sparkBatch": {
 "mainClass": "com.google.cloud.dataproc.templates.main.DataProcTemplate",
 "args": [
 "--template","GCSTOGCS",
 "--templateProperty","project.id=PROJECT_ID",
 "--templateProperty","log.level=LOG_LEVEL",
 "--templateProperty","gcs.gcs.input.location=CLOUD_STORAGE_INPUT_PATH",
 "--templateProperty","gcs.gcs.input.format=INPUT_FILE_FORMAT",
 "--templateProperty","gcs.gcs.output.location=CLOUD_STORAGE_OUTPUT_PATH",
 "--templateProperty","gcs.gcs.output.format=OUTPUT_FILE_FORMAT",
 "--templateProperty","gcs.gcs.write.mode=MODE",
 "--templateProperty","gcs.gcs.temp.table=TEMP_TABLE",
 "--templateProperty","gcs.gcs.temp.query=TEMP_QUERY"
 ],
 "jarFileUris":[
 "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar",
 "file:///usr/lib/spark/connector/spark-avro.jar"
 ]
 }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
 "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",
 "metadata": {
 "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
 "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",
 "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
 "createTime": "2023-02-24T03:31:03.440329Z",
 "operationType": "BATCH",
 "description": "Batch"
 }
}

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月11日 UTC.