Sqoop & Hadoop - How to join/merge old data and new data imported by Sqoop in lastmodified mode? Sqoop & Hadoop - How to join/merge old data and new data imported by Sqoop in lastmodified mode? hadoop hadoop

Sqoop & Hadoop - How to join/merge old data and new data imported by Sqoop in lastmodified mode?


Changed data load using scoop is a two phase process.

  1. 1st phase - load changed data into some temp (stage) table usingsqoop import utility.
  2. 2nd phase - Merge changed data with old data using sqoop-mergeutility.

If the table is small(say few M records) then use full load using sqoop import.

Sometimes it's possible to load only latest partition - in such case use sqoop import utility to load partition using custom query, then instead of merge simply insert overwrite loaded partition into target table, or copy files - this will work faster than sqoop merge.


You can change the existing Sqoop query (by specifying a new custom query) to get ALL the data from the source table instead of getting only the changed data. Refer using_sqoop_to_move_data_into_hive. This would be the simplest way to accomplish this - i.e doing a full data refresh instead of applying deltas.