Loading data from RDBMS to Hadoop with multiple destinations
I see two options to do this:
Setup two diff Sqoop jobs for copying into each cluster. This would be more like two sets of active data than a backup as both are being updated from the source. This will create an extra overload on the relational database system as x2(approx) connections will be created to do data copy.
Use single Sqoop job for loading data into one cluster. From there copy to other cluster using distcp -update (or) distcp -append. Few advantages with this method:
This should reduce the load on the relational database system.
You can leverage the power of MR for faster copy of data b/w clusters.
You have an option to schedule your backup frequency using Oozie.
You can work on active copy or the backup copy.
Let me know your thoughts and if you have already finalized on any solutions, please do share it.