handling Remote dependencies for spark-submit in spark 2.3 with kubernetes
It works as it should with s3a:// urls. Unfortunatly getting s3a running on the stock spark-hadoop2.7.3 is problematic (authentication mainly), so I opted for building spark with Hadoop 2.9.1, since S3A has seen significant development there
I have created a gist with the steps needed to
- build spark with new hadoop dependencies
- build the docker image for k8s
- push image to ECR
The script also creates a second docker image with the S3A dependencies added and base conf settings for enabling S3A using IAM credentials so running in AWS doesn't require putting access/secretkey in conf files/args
I havn't run any production spark jobs yet using the image, but have tested that basic saving and loading to s3a urls does work.
I have yet to experiment with S3Guard which uses DynamoDB to ensure that S3 writes/reads are consistent - similarly to EMRFS
The Init container is created automatically for you by Spark.
For example, you can use
kubectl describe pod [name of your driver svc] and you'll see the Init container named spark-init.
You can also acccess the logs from the init-container via a command like:
kubectl logs [name of your driver svc] -c spark-init
Caveat: I'm not running in AWS, but a custom K8S. My init-container successfully runs a downloads dependencies from an HTTP server (but not S3, strangely).