Spark Streaming with a dynamic lookup table Spark Streaming with a dynamic lookup table hadoop hadoop

Spark Streaming with a dynamic lookup table


You have two options here.

First is to use foreachRDD transformation on top of your DStream. foreachRDD is executed on the driver side, which means that you can create any new RDD there. You can store the time counter and re-read the file from HDFS each 10-15 minutes

Second is to read some file in the transform transformation over the DStream and save the results of it in memory. With this approach you have to read the whole lookup table by each of the executors, which is not efficient

I'd recommend you to use the first approach. To be even more precise, you can store somewhere the flag when the data was last updated, and store the same in your Spark application. On each iteration you check the value of this flag (for instance, stored in HBase or Zookeeper) and compare it to the one stored locally - if it is different, then you re-read the lookup table, if not - perform the operation with the old one