62

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:

deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive dependency conflicts

Has anyone successfully make both work?

mrsrinivas
35.6k13 gold badges133 silver badges132 bronze badges
asked May 21, 2015 at 23:24
3
  • Which version of Apache Spark are you using? Commented May 21, 2015 at 23:45
  • Related: SPARK-7442 Commented May 22, 2015 at 21:34
  • 1.3.1_ scala 2.10.4_hadoop 2.6. I just found that s3:// and s3n:// also doesn't work out of the box (they only works on hadoop 2.4) Commented May 23, 2015 at 21:45

11 Answers 11

63

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.

Here are the key parts, as of December 2015:

  1. Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).

  2. You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.

  3. You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.

  4. In spark.properties you probably want some settings that look like this:

    spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY

  5. If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.

--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

Knight71
2,9595 gold badges39 silver badges66 bronze badges
answered Jan 1, 2016 at 3:37
Sign up to request clarification or add additional context in comments.

3 Comments

This was helpful for me. The only dependency I ended up adding was "org.apache.hadoop" % "hadoop-aws" % "3.0.0-alpha2" at mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/…
Having hadoop-aws 2.7.1 (or higher) JAR on the classpath solved the issue for me, but when running on Amazon EMR I didnt need this, so I made it a provided dependency, my sbt looks like "org.apache.hadoop" % "hadoop-aws" % "2.8.1" % Provided
Sorry haven't seen the big update as it's a very old question, marked as accepted despite not having verified myself
25

I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3

Copy the AWS jars(hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar) which shipped with Hadoop by default

  • Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be

     find / -name hadoop-aws*.jar
     find / -name aws-java-sdk*.jar
    

into spark classpath which holds all spark jars

  • Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below

     find / -name spark-core*.jar
    

in spark-defaults.conf

Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf)

#make sure jars are added to CLASSPATH
spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
spark.hadoop.fs.s3a.access.key={s3a.access.key} 
spark.hadoop.fs.s3a.secret.key={s3a.secret.key} 
#you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.

in spark submit include jars(aws-java-sdk and hadoop-aws) in --driver-class-path if needed.

spark-submit --master yarn \
 --driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \
 --driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \
 other options

Note:

Make sure the Linux user with reading privileges, before running the find command to prevent error Permission denied

answered Nov 8, 2016 at 4:22

2 Comments

Zeppelin Spark version 2.2, I am able to connect to S3 central region which support V4 signature only. Set below properties along with jars (artifacts) suggested by mrsrinivas. ``` System.setProperty("com.amazonaws.services.s3.enableV4", "true") val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoopConf.set("fs.s3a.endpoint", "s3.ca-central-1.amazonaws.com") hadoopConf.set("fs.s3a.access.key", accessKey) hadoopConf.set("fs.s3a.secret.key", secretKey) ```
13

I got it working using the Spark 1.4.1 prebuilt binary with hadoop 2.6 Make sure you set both spark.driver.extraClassPath and spark.executor.extraClassPath pointing to the two jars (hadoop-aws and aws-java-sdk) If you run on a cluster, make sure your executors have access to the jar files on the cluster.

Prasad Khode
6,77912 gold badges47 silver badges62 bronze badges
answered Aug 3, 2015 at 3:12

2 Comments

same problem: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 27, 10.122.113.63): java.io.IOException: No FileSystem for scheme: s3n
If it is default in nature for all s3, add the two variables in $SPARK_HOME/conf/spark-defaults.conf. Ref deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2 is a good source.
8

We're using spark 1.6.1 with Mesos and we were getting lots of issues writing to S3 from spark. I give credit to cfeduke for the answer. The slight change I made was adding maven coordinates to the spark.jar config in the spark-defaults.conf file. I tried with hadoop-aws:2.7.2 but was still getting lots of errors so we went back to 2.7.1. Below are the changes in spark-defaults.conf that are working for us:

spark.jars.packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
spark.hadoop.fs.s3a.access.key <MY ACCESS KEY>
spark.hadoop.fs.s3a.secret.key <MY SECRET KEY>
spark.hadoop.fs.s3a.fast.upload true

Thank you cfeduke for taking the time to write up your post. It was very helpful.

stevel
13.6k1 gold badge41 silver badges54 bronze badges
answered May 25, 2016 at 20:06

Comments

7

Here are the details as of October 2016, as presented at Spark Summit EU: Apache Spark and Object Stores.

Key points

  • The direct output committer is gone from Spark 2.0 due to risk/experience of data corruption.
  • There are some settings on the FileOutputCommitter to reduce renames, but not eliminate them
  • I'm working with some colleagues to do an O(1) committer, relying on Apache Dynamo to give us that consistency we need.
  • To use S3a, get your classpath right.
  • And be on Hadoop 2.7.z; 2.6.x had some problems which were addressed by then HADOOP-11571.
  • There's a PR under SPARK-7481 to pull everything into a spark distro you build yourself. Otherwise, ask whoever supplies to the binaries to do the work.
  • Hadoop 2.8 is going to add major perf improvements HADOOP-11694.

Product placement: the read-performance side of HADOOP-11694 is included in HDP2.5; The Spark and S3 documentation there might be of interest —especially the tuning options.

answered Nov 4, 2016 at 13:06

Comments

3

as you said, hadoop 2.6 doesn't support s3a, and latest spark release 1.6.1 doesn't support hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.

for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support

if you still want to try to use s3a in spark 1.6.x, refer to the answer here: https://stackoverflow.com/a/37487407/5630352

answered May 30, 2016 at 21:40

Comments

2

Using Spark 1.4.1 pre-built with Hadoop 2.6, I am able to get s3a:// to work when deploying to a Spark Standalone cluster by adding the hadoop-aws and aws-java-sdk jar files from the Hadoop 2.7.1 distro (found under $HADOOP_HOME/share/hadoop/tools/lib of Hadoop 2.7.1) to my SPARK_CLASSPATH environment variable in my $SPARK_HOME/conf/spark-env.sh file.

answered Aug 3, 2015 at 1:56

4 Comments

Really? Let me try your solutions again on 1.4.1, I wasn't commited to s3a as issues.apache.org/jira/browse/SPARK-7442 is still marked as 'unresolved'
Ive tried, seems like something else is missing, I keep getting this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 47, 10.122.113.63): java.io.IOException: No FileSystem for scheme: s3n
Oh, and here is deprecation conflict in your solution: |SPARK_CLASSPATH was detected (set to '$value'). |This is deprecated in Spark 1.0+. | |Please instead use: | - ./spark-submit with --driver-class-path to augment the driver classpath | - spark.executor.extraClassPath to augment the executor classpath
Specifically, even as late as Dec 31 2015 you need to use an AWS SDK library compiled in 2014: aws-java-sdk 1.7.4; this answer above is the most accurate answer on this question.
2

You can also add the S3A dependencies to the classpath using spark-defaults.conf.

Example:

spark.driver.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.executor.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.driver.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar

Or just:

spark.jars /usr/local/spark/jars/hadoop-aws-2.7.5.jar,/usr/local/spark/jars/aws-java-sdk-1.7.4.jar

Just make sure to match your AWS SDK version to the version of Hadoop. For more information about this, look at this answer: Unable to access S3 data using Spark 2.2

answered Mar 30, 2018 at 0:07

Comments

2

Here's a solution for pyspark (possibly with proxy):

def _configure_s3_protocol(spark, proxy=props["proxy"]["host"], port=props["proxy"]["port"], endpoint=props["s3endpoint"]["irland"]):
 """
 Configure access to the protocol s3
 https://sparkour.urizone.net/recipes/using-s3/
 AWS Regions and Endpoints
 https://docs.aws.amazon.com/general/latest/gr/rande.html
 """
 sc = spark.sparkContext
 sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
 sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
 sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", proxy)
 sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", port)
 sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", endpoint)
 return spark
stevel
13.6k1 gold badge41 silver badges54 bronze badges
answered Apr 12, 2018 at 11:27

2 Comments

what is the props variable?
a normal python dictionary
1

Here is a scala version that works fine with Spark 3.2.1 (pre-built) with Hadoop 3.3.1, accessing a S3 bucket from a non AWS machine [typically a local setup on a developer machine]

sbt

 libraryDependencies ++= Seq(
 "org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
 "org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided",
 "org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
 "org.apache.hadoop" % "hadoop-aws" % "3.3.1",
 "org.apache.hadoop" % "hadoop-common" % "3.3.1" % "provided"
 )

spark program

 val spark = SparkSession
 .builder()
 .master("local")
 .appName("Process parquet file")
 .config("spark.hadoop.fs.s3a.path.style.access", true)
 .config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
 .config("spark.hadoop.fs.s3a.secret.key", SECRET_KEY)
 .config("spark.hadoop.fs.s3a.endpoint", ENDPOINT)
 .config(
 "spark.hadoop.fs.s3a.impl",
 "org.apache.hadoop.fs.s3a.S3AFileSystem"
 )
 // The enable V4 does not seem necessary for the eu-west-3 region
 // see @stevel comment below 
 // .config("com.amazonaws.services.s3.enableV4", true)
 // .config(
 // "spark.driver.extraJavaOptions",
 // "-Dcom.amazonaws.services.s3.enableV4=true"
 // )
 .config("spark.executor.instances", "4")
 .getOrCreate()
 spark.sparkContext.setLogLevel("ERROR")
 val df = spark.read.parquet("s3a://[BUCKET NAME]/.../???.parquet")
 df.show()

note: region is in the form s3.[REGION].amazonaws.com e.g. s3.eu-west-3.amazonaws.com

s3 configuration

To make the bucket available from outside of AWS, add a Bucket Policy of the form:

{
 "Version": "2012年10月17日",
 "Statement": [
 {
 "Sid": "Statement1",
 "Effect": "Allow",
 "Principal": {
 "AWS": "arn:aws:iam::[ACCOUNT ID]:user/[IAM USERNAME]"
 },
 "Action": [
 "s3:Delete*",
 "s3:Get*",
 "s3:List*",
 "s3:PutObject"
 ],
 "Resource": "arn:aws:s3:::[BUCKET NAME]/*"
 }
 ]
}

The supplied ACCESS_KEY and SECRET_KEY to the spark configuration must be those of the IAM user configured on the bucket

answered Feb 24, 2022 at 20:16

2 Comments

looks good but you don't need - the fs.s3a.impl declaration. this is a superstition only found in stack overflow examples - the com.amazonaws.services.s3.enableV4=true sysprop. that was only for older aws sdk versions.
@stevel Thanks - will edit the answer then
-2

I am using spark version 2.3, and when I save a dataset using spark like:

dataset.write().format("hive").option("fileFormat", "orc").mode(SaveMode.Overwrite)
 .option("path", "s3://reporting/default/temp/job_application")
 .saveAsTable("job_application");

It works perfectly and saves my data into s3.

Stephen Rauch
50.1k32 gold badges118 silver badges143 bronze badges
answered Apr 22, 2018 at 15:11

1 Comment

IF you are using "s3" then you are using Amazon EMR, so unrelated. And It worked for you in the absence of failures and observable inconsistencies. You cannot rely on that working in production, hence the S3A committers of Hadoop 3.1

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.