Spark Streaming with a dynamic lookup table

hadoop hbase apache-spark spark-streaming

You have two options here.

First is to use foreachRDD transformation on top of your DStream. foreachRDD is executed on the driver side, which means that you can create any new RDD there. You can store the time counter and re-read the file from HDFS each 10-15 minutes

Second is to read some file in the transform transformation over the DStream and save the results of it in memory. With this approach you have to read the whole lookup table by each of the executors, which is not efficient

I'd recommend you to use the first approach. To be even more precise, you can store somewhere the flag when the data was last updated, and store the same in your Spark application. On each iteration you check the value of this flag (for instance, stored in HBase or Zookeeper) and compare it to the one stored locally - if it is different, then you re-read the lookup table, if not - perform the operation with the old one

CodeHunter

Spark Streaming with a dynamic lookup table

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last