Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs hadoop hadoop

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs


You are using the wrong tools for the job.

Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.

Get the dataset with Sqoop and then process it with Spark.


you can try the following:-

  1. Read data from netezza without any partitions and with increased fetch_size to a million.

    sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
  2. repartition the data before writing it to final file.

    val df3 = df2.repartition(10) //to reduce the shuffle 
  3. ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.

    df3.write.format("ORC").save("hdfs://Hostname/test")


@amitabhAlthough marked as an answer, I disagree with it.

Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).

I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.

Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:1. Using 14 executors with 1 core each2. Using 1 executor with 14 cores (other end of the spectrum)

Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).Use: executor.cores, executor.instances in sparkConf.set to play with the configs.

If this does not significantly increase performance, the next thing would be to look at the executor memory.

Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.