Streaming data store in hive using spark

scala hadoop apache-spark hive spark-streaming

I would give it a try!

BUT kafka->spark->hive is not the optimal pipline for your usecase.

hive is normally based on hdfs which is not designed for small number of inserts/updates/selects. So your plan can end up in the following problems:
- many small files which ends in bad performance
- your window gets to small because it takes to long

Suggestion:

option 1: - use kafka just as buffer queue and design your pipeline like - kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table

Option 2:

kafka->flume/spark to hbase/kudu->batch spark to hive/impala

option 1 has no "realtime" analysis option. It depends on how often you run the batch spark

option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.Kudu makes the architecture even easier.

Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.

But basicly it would work like the following:

xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")

CodeHunter

Streaming data store in hive using spark

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last