Apache Flink deployment in Kubernetes - availability & scalability

Though k8s can restart the failed Flink pod, how can Flink restore its state to recover?

From Flink Documentation we have:

Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution.

It means that you need to have a Check Storage mounted in your pods to be able to recover the state.

In Kubernetes you could use Persistent Volumes to share the data across your pods.

Actually there are a lot of supported plugins, see here.

You can have more replicas of TaskManager, but in Kubernetes you don't need to take care of HA for JobManager since you can use Kubernetes self-healing deployment.

To use self-healing deployment in Kubernetes you just need to create a deployment and set the replica to 1, like this:

apiVersion: apps/v1kind: Deploymentmetadata:  name: nginxspec:  replicas: 1  selector:    matchLabels:      app: nginx  template:    metadata:      labels:        app: nginx    spec:      containers:      - name: nginx        image: nginx        ports:        - name: http          containerPort: 80        imagePullPolicy: IfNotPresent