Move data from oracle to HDFS, process and move to Teradata from HDFS Move data from oracle to HDFS, process and move to Teradata from HDFS oracle oracle

Move data from oracle to HDFS, process and move to Teradata from HDFS


Looks like you have several questions so let's try to break it down.

Importing in HDFS

It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:

sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir

For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.

Processing

As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.

Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.

In Hive you could run your query and tell it to write the results in HDFS:

hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."

Exporting into TeraData

Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:

sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output

The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.


If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.

I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.

As a disclosure, I work for Dell Software.