Spark: run InputFormat as singleton

database hadoop apache-spark

I believe the best option here is to connect to your DB from driver, not from executors. This part of the system anyway would be a bottleneck.

database hadoop apache-spark

Have you thought of queueing (buffer) then using spark streaming to dequeue and use your output format to write.

database hadoop apache-spark

If data from your DB fits into RAM memory of your spark-driver you can load it there as a collection and then parallelize it to an RDD https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#parallelized-collections

CodeHunter