Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

You are using the wrong tools for the job.

Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.

Get the dataset with Sqoop and then process it with Spark.

hadoop apache-spark-sql sqoop bigdata

you can try the following:-

Read data from netezza without any partitions and with increased fetch_size to a million.

sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")

repartition the data before writing it to final file.

val df3 = df2.repartition(10) //to reduce the shuffle

ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
```
df3.write.format("ORC").save("hdfs://Hostname/test")
```

hadoop apache-spark-sql sqoop bigdata

@amitabhAlthough marked as an answer, I disagree with it.

Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).

I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.

Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:1. Using 14 executors with 1 core each2. Using 1 executor with 14 cores (other end of the spectrum)

Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).Use: executor.cores, executor.instances in sparkConf.set to play with the configs.

If this does not significantly increase performance, the next thing would be to look at the executor memory.

Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.

CodeHunter

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last