handling Remote dependencies for spark-submit in spark 2.3 with kubernetes handling Remote dependencies for spark-submit in spark 2.3 with kubernetes kubernetes kubernetes

handling Remote dependencies for spark-submit in spark 2.3 with kubernetes


It works as it should with s3a:// urls. Unfortunatly getting s3a running on the stock spark-hadoop2.7.3 is problematic (authentication mainly), so I opted for building spark with Hadoop 2.9.1, since S3A has seen significant development there

I have created a gist with the steps needed to

  • build spark with new hadoop dependencies
  • build the docker image for k8s
  • push image to ECR

The script also creates a second docker image with the S3A dependencies added and base conf settings for enabling S3A using IAM credentials so running in AWS doesn't require putting access/secretkey in conf files/args

I havn't run any production spark jobs yet using the image, but have tested that basic saving and loading to s3a urls does work.

I have yet to experiment with S3Guard which uses DynamoDB to ensure that S3 writes/reads are consistent - similarly to EMRFS


The Init container is created automatically for you by Spark.

For example, you can use

kubectl describe pod [name of your driver svc] and you'll see the Init container named spark-init.

You can also acccess the logs from the init-container via a command like:

kubectl logs [name of your driver svc] -c spark-init

Caveat: I'm not running in AWS, but a custom K8S. My init-container successfully runs a downloads dependencies from an HTTP server (but not S3, strangely).