Spark: run InputFormat as singleton Spark: run InputFormat as singleton hadoop hadoop

Spark: run InputFormat as singleton


I believe the best option here is to connect to your DB from driver, not from executors. This part of the system anyway would be a bottleneck.


Have you thought of queueing (buffer) then using spark streaming to dequeue and use your output format to write.


If data from your DB fits into RAM memory of your spark-driver you can load it there as a collection and then parallelize it to an RDD https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#parallelized-collections