Kubernetes High Availability for Stateful Workloads

LINBIT is a company with deep roots in Linux High Availability (HA). Because of this, LINBIT has some opinions on what HA is, and how it can be achieved. Kubernetes’ approach to HA generally involves sprawling many replicas of an application across many cluster nodes, therefore making it less impactful when a single node or application instance fails. This approach is great for stateless applications, or applications that can tolerate the performance of shared storage. In contrast, IO-demanding stateful applications often do not “sprawl” well, or sometimes at all. As a result, these applications are “on their own” in terms of achieving high availability. LINSTOR’s High Availability Controller aims to provide high availability to pods in Kubernetes that cannot do so on their own.

StatefulSets, Deployments, and ReplicaSets in Kubernetes will eventually reschedule pods from failed nodes, respecting your defined replica counts. The time and user intervention it takes to do that, however, isn’t what LINBIT typically considers highly available. Pod eviction behavior differs between StatefulSets and Deployments, and between versions of Kubernetes – and honestly it’s sometimes buggy. Kubernetes v1.20.2 is the latest version available, and applying taint tolerations to pods is recommended for controlling pod eviction. However, there are open issues on Kubernetes’ GitHub (since v1.18) which report that the NoExecute taint is not always applying to dead nodes. That bug leaves pods stranded on dead nodes indefinitely. Prior to Kubernetes v1.18, I would set the –pod-eviction-timeout on the kube-controller-manager for more aggressive pod eviction, but that’s no longer supported. My point is, Kubernetes’ approach to HA for singleton workloads isn’t exactly straight forward.

Demonstration of LINSTOR’s HA Controller for Stateful Workloads

LINSTOR’s HA Controller aims to improve pod eviction behavior for workloads backed by LINSTOR volumes. It does this by inspecting the quorum status of the DRBD devices that LINSTOR provisions. If the replication network breaks, the active replica of the volume loses quorum, and LINSTOR’s HA Controller will move the pod to another worker that can access a replica of the volume. Here is a short video (~5min) that shows the LINBIT HA Controller in action:

As I mention in the video, the requirements for using LINSTOR’s High Availability Controller for Kubernetes are that your volume’s DRBD replicas >= 2, your Kubernetes cluster has workers >= 3, and that you’ve labeled your pods with linstor.csi.linbit.com/on-storage-lost: remove. If you meet those requirements, you’re able to confidently move Stateful workloads much sooner.

In the video, I show a 5 node Kubernetes cluster (1 master, 4 workers), with the follow LINSTOR StorageClasses defined:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r1"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "1"
  storagePool: "lvm-thin"
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r2"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "2"
  storagePool: "lvm-thin"
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r3"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "3"
  storagePool: "lvm-thin"
reclaimPolicy: Delete

Then, I use the following StatefulSet definition to create a workload backed by the linstor-csi-lvm-thin-r2 StorageClass, with pods labeled for the LINSTOR’s HA Controller:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: webapp
spec:
  selector:
    matchLabels:
      app: web
  serviceName: web-svc
  replicas: 1
  template:
    metadata:
      labels:
        app: web
        linstor.csi.linbit.com/on-storage-lost: remove
    spec:
      containers:
      - name: web
        image: httpd:latest
        ports:
        - containerPort: 80
          hostPort: 2080
          name: http
        volumeMounts:
        - name: www
          mountPath: /usr/local/apache2/htdocs
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: linstor-csi-lvm-thin-r2
      resources:
        requests:
          storage: 1Gi

I then “failed” the worker node, using sysrq-triggers to halt it, and the StatefulSet managed pod gets safely evicted long before Kubernetes’ pod eviction would have kicked in.

So, please check out the video above, try out the controller in your clusters, and drop any comments/questions you have in social media or in our slack community.

Disclaimer: The described software (LINSTOR, K8S-Operator, HA-Controller) are part of LINBIT SDS for Kubernetes. LINBIT SDS for Kubernetes is a bundle of access to pre-built container images, 24×7 enterprise class by the creators of it. Alternatively the components are also available from their upstream sources in the Piraeus DataStore project.

Like? Share it with the world.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp
Share on vk
VK
Share on reddit
Reddit
Share on email
Email