How to load RDDs from S3 files from spark-shell? How to load RDDs from S3 files from spark-shell? hadoop hadoop

How to load RDDs from S3 files from spark-shell?


org.apache.hadoop.fs.StreamCapabilities is in hadoop-common-3.1.jarYou are probably mixing version of Hadoop JARs, which, as coved in the s3a troubleshooting docs is doomed.

Spark shell works fine with the right JARs in. But ASF Spark releases don't work with Hadoop 3.x yet, due to some outstanding issues. Stick to Hadoop 2.8.x and you'll get good S3 performance without so much pain.


I found a path that fixed the issue, but I have no idea why.

  1. Create an SBT IntelliJ project
  2. Include the below dependencies and overrides
  3. Run the script (sans require statement) from sbt console

    scalaVersion := "2.11.12"libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.0"libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.0"dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"

The key part, naturally, is overriding the jackson dependencies.