Run Spark jobs with DataprocFileOutputCommitter

The DataprocFileOutputCommitter feature is an enhanced version of the open source FileOutputCommitter. It enables concurrent writes by Apache Spark jobs to an output location.

Limitations

The DataprocFileOutputCommitter feature supports Spark jobs run on Dataproc Compute Engine clusters created with the following image versions:

  • 2.1 image versions 2.1.10 and higher

  • 2.0 image versions 2.0.62 and higher

Use DataprocFileOutputCommitter

To use this feature:

  1. Create a Dataproc on Compute Engine cluster using image versions 2.1.10 or 2.0.62 or higher.

  2. Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false as a job property when you submit a Spark job to the cluster.

    • Google Cloud CLI example:
    gcloud dataproc jobs submit spark \
      --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
      --region=REGION \
      other args ...
    
    • Code example:
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory")
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")
    

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月15日 UTC.