how to efficiently move data from Kafka to an Impala table? how to efficiently move data from Kafka to an Impala table? hadoop hadoop

how to efficiently move data from Kafka to an Impala table?


If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.

You can either dump the data to a parket file on HDFS you can load in Impala.You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).

Addign something like this to your Kafka Connect configuration might do the trick:

# Don't flush less than 1000 messages to HDFSflush.size = 1000 # Dump to parquet files   format.class=io.confluent.connect.hdfs.parquet.ParquetFormatpartitioner.class = TimebasedPartitioner# One file every hour. If you change this, remember to change the filename format to reflect this changepartition.duration.ms = 3600000# Filename formatpath.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm