Import data on HDFS to SQL Server or export data on HDFS to SQL Server Import data on HDFS to SQL Server or export data on HDFS to SQL Server hadoop hadoop

Import data on HDFS to SQL Server or export data on HDFS to SQL Server


Both of your options use the same method: Apache Sqoop's Export utility. Using the licensed Microsoft connector/driver jar should expectedly yield more performance for the task than using a generic connector offered by Apache Sqoop.

In terms of maintenance, there should be none once you have it working fine. So long as the version of SQL Server in use is supported by the driver jar, it should continue to work as normally expected of it.

In terms of configuration, you may initially have to manually tune to find the best -m value for parallelism of your Export MapReduce job launched by the export tool. Using a too high value would cause problems on the DB side, while using a too low value would not give you ideal performance. Some trial and error is required here to arrive at the right -m value, along with knowledge of the load periods of your DB, in order to set the parallelism right.

The Apache Sqoop (v1) doc page for users of the export tool also lists down a set of common reasons for the failure of the export job. You may want to view those here.

On the MapReduce side, you may also want to dedicate a defined scheduler pool or queue for such external-writing jobs as they may be business critical, and schedulers like FairScheduler and CapacityScheduler help define SLA guarantees on each pool or queue such that the jobs get adequate resources to run when they're launched.