How to ensure ordered processing of events using spark streaming?

hadoop apache-spark apache-storm amazon-kinesis

Amazon Kinesis uses shards in the stream as data containers. And inside a shard, it is guaranteed that the values are processed sequentially.

You can exploit this feature for your use case: So use predefined "Partition Key" values while putting records in the stream.

For example, if you are dealing with user values, you can use the id for a user's event as partition key on the producer side.

User #1: First makes purchase, then updates score, after that browses to page X etc.
User #2: First does X, then does Y, after that Z event occurs etc.

That way, you'll be sure that the events for a single user will be processed in a timely manner. And you'll have your parallelism for different user's events (i.e. Kinesis Records).

hadoop apache-spark apache-storm amazon-kinesis

You can have just one partition and by that stop parallelism.

Also from my opinion, for scenario like this Apache kafka is a better choice.

CodeHunter

How to ensure ordered processing of events using spark streaming?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last