Dataproc documentation

Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

Go to the Dataproc product page for more.

Start your proof of concept with 300ドル in free credit

  • Develop with our latest Generative AI models and tools.
  • Get free usage of 20+ popular products, including Compute Engine and AI APIs.
  • No automatic charges, no commitment.

Keep exploring with 20+ always-free products.

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Explore self-paced training, use cases, reference architectures, and code samples with examples of how to use and connect Google Cloud services.
Training
Training and tutorials

Run a Spark job on Google Kubernetes Engine

Submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.

Training
Training and tutorials

Introduction to Cloud Dataproc: Hadoop and Spark on Google Cloud

This course features a combination of lectures, demos, and hands-on labs to create a Dataproc cluster, submit a Spark job, and then shut down the cluster.

Training
Training and tutorials

Machine Learning with Spark on Dataproc

This course features a combination of lectures, demos, and hands-on labs to implement logistic regression using a machine learning library for Apache Spark running on a Dataproc cluster to develop a model for data from a multivariable dataset.

Use case
Use cases

Workflow scheduling solutions

Schedule workflows on Google Cloud.

Use case
Use cases

Migrate HDFS Data from On-Premises to Google Cloud

How to move data from on-premises Hadoop Distributed File System (HDFS) to Google Cloud.

Use case
Use cases

Manage Java and Scala dependencies for Apache Spark

Recommended approaches to including dependencies when you submit a Spark job to a Dataproc cluster.

Code sample
Code Samples

Python API samples

Call Dataproc APIs from Python.

Code sample
Code Samples

Java API samples

Call Dataproc APIs from Java.

Code sample
Code Samples

Node.js API samples

Call Dataproc APIs from Node.js.

Code sample
Code Samples

Go API samples

Call Dataproc APIs from Go.

Related videos

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月21日 UTC.