how to efficiently move data from Kafka to an Impala table?
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFSflush.size = 1000 # Dump to parquet files format.class=io.confluent.connect.hdfs.parquet.ParquetFormatpartitioner.class = TimebasedPartitioner# One file every hour. If you change this, remember to change the filename format to reflect this changepartition.duration.ms = 3600000# Filename formatpath.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm