How to load RDDs from S3 files from spark-shell?
org.apache.hadoop.fs.StreamCapabilities
is in hadoop-common-3.1.jarYou are probably mixing version of Hadoop JARs, which, as coved in the s3a troubleshooting docs is doomed.
Spark shell works fine with the right JARs in. But ASF Spark releases don't work with Hadoop 3.x yet, due to some outstanding issues. Stick to Hadoop 2.8.x and you'll get good S3 performance without so much pain.
I found a path that fixed the issue, but I have no idea why.
- Create an SBT IntelliJ project
- Include the below dependencies and overrides
Run the script (sans
require
statement) fromsbt console
scalaVersion := "2.11.12"libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.0"libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.0"dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
The key part, naturally, is overriding the jackson dependencies.