Spark Streaming: Issues when processing time > batch time Spark Streaming: Issues when processing time > batch time hadoop hadoop

Spark Streaming: Issues when processing time > batch time


I think one thing that may have confused you is the relationship between the length of a job, and the frequency.

From what you describe, with the resources available it seems that in the end the job took about 5 minutes to complete. However your batch frequency is 1 minute.

So as a result, every 1 minute you kick off some batch that takes 5 minutes to complete.

As a result, in the end you will expect to see HDFS receive nothing for the first few minutes, and then you keep receiving something every 1 minute (but with a 5 minute 'delay' from when the data went in).