What is most efficient way to write from kafka to hdfs with files partitioning into dates

hadoop hdfs apache-kafka

You should definitely check out Camus API implementation from linkedIn. Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps.

Project is available at github at - https://github.com/linkedin/camus

Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS –

Decoding Messages read from Kafka

Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder which implements logic to partition data based on timestamp. A set of predefined Decoders are present in this directory and you can write your own based on these. camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/

Writing messages to HDFS

Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what’s the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these.

camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common

hadoop hdfs apache-kafka

If you're looking for a more real-time approach you should check out StreamSets Data Collector. It's also an Apache licensed open source tool for ingest.

The HDFS destination is configurable to write to time based directories based on the template you specify. And it already includes a way to specify a field in your incoming messages to use to determine the time a message should be written. The config is called "Time Basis" and you can specify something like ${record:value("/ts")}.

*full disclosure I'm an engineer working on this tool.

hadoop hdfs apache-kafka

if you are using Apache Kafka 0.9 or above, you can use the Kafka Connect API.

check out https://github.com/confluentinc/kafka-connect-hdfs

This is a Kafka connector for copying data between Kafka and HDFS.

CodeHunter

What is most efficient way to write from kafka to hdfs with files partitioning into dates

Decoding Messages read from Kafka

Writing messages to HDFS

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last