With version 9, we introduced the capability of building clusters larger than two nodes to DRBD®. That gave us the means to implement the concept of quorum and eliminate the old plague of many clusters: the split-brain. The concept is intriguingly simple. A partition of the cluster may modify the data when it contains the majority of nodes.
The “In Real Life” details
But then real life gets you, and you start to add things like:
- Last man standing. When a secondary node gracefully disconnects from a cluster that has a primary somewhere, it marks itself as outdated while disconnecting. A group of outdated nodes can never become quorate, no matter how many of them there are. So when the remaining cluster knows that the offline node marked itself as outdated, it does not need to consider that node when calculating the quorum status. More…
- Diskless tiebreaker nodes. In a cluster with an even number of nodes with backing storage (think: two), a partition that has precisely half of the nodes may keep quorum when it also has contact with the majority of the diskless nodes (think: one). An important detail is a partition of half of the nodes with storage can never gain quorum when it establishes connections to diskless nodes. More…
Soon, DRBD coauthor Lars pointed out we were missing an opportunity to keep quorum. In a three-node cluster, where two nodes have backing storage and one node is diskless, the diskless node can help maintain quorum when one of the diskful nodes goes down. Quorum can even be maintained if some time later, the diskless node goes down and only one diskful node remains. However, if one of the diskful nodes loses both connections concurrently, it must lose quorum.
DRBD discovers broken network connections one-by-one. There is no concept of losing peers concurrently. Adding such a concept was wrong because there is no answer to the question, what is the right time frame to consider that connection breakage was concurrent?
After some time, I realized that the correct algorithm is to look at the cluster’s membership. When a network connection breaks, the node needs to find out which nodes the current cluster members are and remember that. When one partition has quorum, it knows that the non-members cannot have quorum. Consequently, it can keep quorum if it loses the tiebreaker node at a later point in time.
- Connection B – C goes down, all nodes keep quorum.
- Connection A – C goes down, A and B keep quorum due to tiebreaker logic. A and B notice that C cannot have quorum and remember that.
- Connection A – B goes down. A knows that C is without quorum, therefore A keeps quorum.
Is This Special Case for Keeping Quorum Relevant?
Consider a LINSTOR® installation combined with a workload manager, such as Kubernetes, OpenNebula, CloudStack, XCP-ng, OpenStack, or Proxmox, with more than three nodes on the LINSTOR level. One of the many DRBD clusters that LINSTOR might create from a given storage resource definition might consist of two nodes with storage and a diskless node. The latter gets automatically added by LINSTOR as a tiebreaker node. When the node running a virtual machine (VM) or container fails, the workload manager might want to restart the VM or container on some node that is not yet part of that tiny DRBD cluster. From DRBD’s point of view:
- At first, A, the primary node, crashes.
- The workload orchestrator wants to access the data on a node, not part of the DRBD cluster. So LINSTOR adds this node as a new diskless node D. LINSTOR realizes that D now fulfills the tiebreaker function for the DRBD cluster. Earlier LINSTOR added node B on its own so that the DRBD cluster has a tiebreaker. That becomes redundant, and LINSTOR removes B at this point. DRBD keeps quorum. With an older DRBD release, the partition would have lost quorum in this step.
- The workload orchestrator promotes the recently added diskless node D by starting the VM or container there.
From now on, DRBD’s view of the membership is in debugfs:
cat /sys/kernel/debug/drbd/resources/<resource>/members. The value is a node mask. Each set bit means that the node with that node ID is a member of this partition.
Availability of the Quorum Improvement
DRBD 9.1.13 and DRBD-9.2.2 ship with this improvement to the quorum implementation.