Apache flink on Kubernetes - Resume job if jobmanager crashes

kubernetes apache-flink high-availability flink-streaming

Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices, CompletedCheckpointStore, CheckpointIDCounter and SubmittedJobGraphStore which can bring you quite far.

Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter and all the submitted job graphs for the SubmittedJobGraphStore. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.

kubernetes apache-flink high-availability flink-streaming

I implemented a light version of file-based HA, based on Till's answer and Xeli's partial implementation.
You can find the code in this github repo - runs well in production.

Also wrote a blog series explaining how to run a job cluster on k8s in general and about this file-based HA implementation specifically.

kubernetes apache-flink high-availability flink-streaming

For everyone interested in this, I currently evaluate and implement a similar solution using Kubernetes ConfigMaps and a blob store (e.g. S3) to persist job metadata overlasting JobManager restarts. No need to use local storage as the solution relies on state persisted to blob store.

Github thmshmm/flink-k8s-ha

Still some work to do (persist Checkpoint state) but the basic implementation works quite nice.

If someone likes to use multiple JobManagers, Kubernetes provides an interface to do leader elections which could be leveraged for this.

CodeHunter

Apache flink on Kubernetes - Resume job if jobmanager crashes

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last