Loading data from RDBMS to Hadoop with multiple destinations Loading data from RDBMS to Hadoop with multiple destinations hadoop hadoop

Loading data from RDBMS to Hadoop with multiple destinations


I see two options to do this:

  • Setup two diff Sqoop jobs for copying into each cluster. This would be more like two sets of active data than a backup as both are being updated from the source. This will create an extra overload on the relational database system as x2(approx) connections will be created to do data copy.

  • Use single Sqoop job for loading data into one cluster. From there copy to other cluster using distcp -update (or) distcp -append. Few advantages with this method:

    • This should reduce the load on the relational database system.

    • You can leverage the power of MR for faster copy of data b/w clusters.

    • You have an option to schedule your backup frequency using Oozie.

    • You can work on active copy or the backup copy.

Let me know your thoughts and if you have already finalized on any solutions, please do share it.