How to access s3a:// files from Apache Spark?

hadoop apache-spark amazon-s3

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.

Here are the key parts, as of December 2015:

Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
In spark.properties you probably want some settings that look like this:
spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY
If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.

--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

hadoop apache-spark amazon-s3

I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3

Copy the AWS jars(hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar) which shipped with Hadoop by default

Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be
```
  find / -name hadoop-aws*.jar  find / -name aws-java-sdk*.jar
```

into spark classpath which holds all spark jars

Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
```
  find / -name spark-core*.jar
```

in `spark-defaults.conf`

Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf)

#make sure jars are added to CLASSPATHspark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jarspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  spark.hadoop.fs.s3a.access.key={s3a.access.key} spark.hadoop.fs.s3a.secret.key={s3a.secret.key} #you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.

in spark submit include jars(aws-java-sdk and hadoop-aws) in --driver-class-path if needed.

spark-submit --master yarn \  --driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \  --driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \  other options

Note:
Make sure the Linux user with reading privileges, before running thefind command to prevent error Permission denied

hadoop apache-spark amazon-s3

I got it working using the Spark 1.4.1 prebuilt binary with hadoop 2.6Make sure you set both spark.driver.extraClassPath and spark.executor.extraClassPath pointing to the two jars (hadoop-aws and aws-java-sdk)If you run on a cluster, make sure your executors have access to the jar files on the cluster.

CodeHunter

How to access s3a:// files from Apache Spark?

in `spark-defaults.conf`

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

How to access s3a:// files from Apache Spark?

in spark-defaults.conf

Recent Posts

in `spark-defaults.conf`