Upgrade and Failure Domains in Kubernetes Upgrade and Failure Domains in Kubernetes kubernetes kubernetes

Upgrade and Failure Domains in Kubernetes


A "domain" is a grouping of host nodes.

Is not that simple, it would be more accurate if you said "A 'domain' is a logical grouping of resources".

To understand it correctly, you have to first understand most components in isolation. I recommend these readings first:

Then, we can take some points out of it:

  • Nodes are not Virtual Machines, Nodes runs on top of azure virtual machines.

    They often have a 1:1 mapping, but in some cases you can have 5:1 node/VM mapping, an example is when you install a local development cluster.

  • Azure Virtual machines has Update Domains and Fault Domains, Service fabric nodes has Upgrade Domains and Fault Domains

    As much they look the same, they have their differences:

    Fault Domains:

    • VM Fault Domains are isolated slots for physical deployment, that means: Power Supply, Network, Disks, and so on... They are limited by region.
    • SF Fault Domains are logical node slots for application deployment, that means, when SF deploy an application to nodes it will distribute on different fault domains, to make them reliable, most of the time FD will be mapped to VM Fault Domains, but in complex scenarios, you can map this to anything, for example, you could map an entire region to single SF FD.

    .

    Update\Upgrade Domains:

    • VM Update Domains are about OS\Hardware patches and updates, separate update domains will be handled in isolation and not updated at the same time, so that, when an OS update is required to bring down your VM, they will update domain by domain. Lower number of update domains means more machines will be put down during updates.
    • SF Upgrade Domains use a similar approach of VM update domains, but focused on the services and the cluster upgrade itself, bringing down UD per UD at a time and moving to the next one when the previous UD succeed.
    • In both cases, you adjust the Update\Upgrade to how many instances (%) of your vm\nodes\services can be put down during upgrades. So for example, if your service has 100 instances on a cluster with 5 UD, SF will update 20 services at time, if you increase the number of UD to 10 for example, the number of instances down will reduce to 10 instances at time, but the time to deploy your application will increase at same proportion.

Based on that, you can see FD & UP as a matrix of reliable deployment slots, as much as you have, more the reliability will increase (with trade offs, like the update time required). Example below taken from SF docs:

enter image description here

Service Fabric, out of the box tries to place your services instances on different FD\UD on best effort, that means, if possible they will be on different FD\UD otherwise it will find another one with least number of instances of the service being deployed.

And about Kubernetes:

On Kubernetes, these features are not out of the box, k8s have the concept of zones, but according to the docs, they are limited by regions, they cannot span across regions.

Kubernetes will automatically spread the pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.

This is a best-effort placement, and so if the zones in your cluster are heterogeneous (e.g. different numbers of nodes, different types of nodes, or different pod resource requirements), this might prevent equal spreading of your pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

Is not the same as as FD but is a very similar concept.

To achieve a similar result as SF, will be required to deploy your cluster across zones or map the nodes to VM FD\UD, so that they behave as nodes on SF. Add the labels to the nodes to identify these domains. You also would need to create NodeType labels on the nodes over different FD, so that you can use for deploying your pods on delimited nodes.

For example:

  • Node01: FD01 : NodeType=FrontEnd
  • Node02: FD02 : NodeType=FrontEnd
  • Node03: FD03 : NodeType=FrontEnd
  • Node04: FD01 : NodeType=BackEnd
  • Node05: FD02 : NodeType=BackEnd

When you deploy you application, you should make use of the affinity feature to assign PODs to a node, and in this case your service would have:

  • Required Affinity to a NodeType=FrontEnd
  • Prefered AntiAfinity to ContainerName=[itself]

With these settings, using of the affinity and anti-affinity k8s will try to place replicas\instances of your container on separate nodes, and the nodes will be already separate by FD\zone delimited by the NoteType labels, then k8s will handle the rolling updates as SF does.

Because the anti-affinity rules are prefered, k8s will try to balance across these nodes on best effort, if no valid nodes are available it will start adding more instances on node that already contains instances of same container,

Conclusion

It is a bit of extra work, but not much different to what is currently used on other solutions. The major concern here will be configuring the Nodes on FD\Zones, after you place your nodes on the right FD, the rest will work smoothly.

On SF you don't have to worry about this when you deploy a cluster on Azure, but if you do it from scratch, is a big work to do, even bigger than k8s.

NOTE: If you use AKS, it will distribute the nodes across availability sets (set which specifies VM fault domains and update domains). Currently, according to this post, AKS does not provide the zone distribution for you, so you would have to do it from scratch if you need it this level of distribution.


From the docs 1, 2, 3:

  • Process Health CheckingThe simplest form of health-checking is just process level health checking. The Kubelet constantly asks the Docker daemon if the container process is still running, and if not, the container process is restarted. In all of the Kubernetes examples you have run so far, this health checking was actually already enabled. It’s on for every single container that runs in Kubernetes

  • Kubernetes supports user implemented application health-checks. These checks are performed by the Kubelet to ensure that your application is operating correctly for a definition of “correctly” that you provide.

Currently, there are three types of application health checks that you can choose from:

  1. HTTP Health Checks - The Kubelet will call a web hook. If it returns between 200 and 399, it is considered success, failure otherwise. See health check examples here.
  2. Container Exec - The Kubelet will execute a command inside your container. If it exits with status 0 it will be considered a success. See health check examples here.
  3. TCP Socket - The Kubelet will attempt to open a socket to your container. If it can establish a connection, the container is considered healthy, if it can’t it is considered a failure.In all cases, if the Kubelet discovers a failure the container is restarted.

    • Node health checks

If the Status of the Ready condition [of a node] is “Unknown” or “False” for longer than the pod-eviction-timeout, an argument is passed to the kube-controller-manager and all of the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is five minutes. In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on it. The decision to delete the pods cannot be communicated to the kubelet until it re-establishes communication with the apiserver. In the meantime, the pods which are scheduled for deletion may continue to run on the partitioned node

  • Application upgrades

Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.


Those abstractions don't currently exist in kubernetes, though the desired behaviors can often be achieved in an automated fashion.

The meta-model for kubernetes involves agents (called Controllers and Operators) continuously watching events and configuration on the cluster and gradually reconciling cluster state with the Controller's declarative configuration. The sudden loss of a Node hosting Pods will result in the IPs corresponding to the lost Pods being removed from Services and ReplicationControllers for those Pods spinning up new versions on other Nodes, ensuring otherwise that co- and anti-scheduling constraints are met.

Similarly, application upgrades usually occur through changes to a Deployment, which results in new Pods being scheduled and old Pods being unscheduled in an automated, gradual manner.

Custom declarative configurations are now possible with CustomResourceDefinitions, so this model is extensible. The underlying primitives and machinery are there for someone to introduce top level declarative abstractions like FailureDomains and UpgradeDomains, managed by custom Operators.

The kube ecosystem is so enormous and moving so quickly that something like this will likely emerge, and will also likely be met by competitor concepts.

Bottom line for a plant owner considering adoption is that Kubernetes is really still a toolsmith's world. There are an enormous number of tools, and a similarly enormous amount of unfinished product.