how to efficiently move data from Kafka to an Impala table?

If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.

You can either dump the data to a parket file on HDFS you can load in Impala.You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).

Addign something like this to your Kafka Connect configuration might do the trick:

# Don't flush less than 1000 messages to HDFSflush.size = 1000 # Dump to parquet files   format.class=io.confluent.connect.hdfs.parquet.ParquetFormatpartitioner.class = TimebasedPartitioner# One file every hour. If you change this, remember to change the filename format to reflect this changepartition.duration.ms = 3600000# Filename formatpath.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm

CodeHunter

how to efficiently move data from Kafka to an Impala table?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last