With many years of experience in high-availability (HA), we noticed that the typical Pacemaker stack used in such HA scenarios introduces great complexity. We have to deal with misconfigured Pacemaker setups in many of our customer tickets. This occurrence is neither the fault of our customers nor Pacemaker’s fault. HA is a complex topic, and Pacemaker is a very flexible tool. With this flexibility and the separation of the whole Pacemaker stack in a sheer endless amount of components (corosync/knet, lrmd, pengine, sonithd, crmd, cib, crm_shell, pcs…) the overall complexity can be overwhelming.
The typical HA stack is also hard to build, and it takes days or even weeks until the complete stack builds and is tested on all the distributions and architectures we support.
So our goals are as follow:
– reduce complexity, keep things simple.
– shift responsibility to well-known and tested components, don’t reinvent the wheel.
– keep configuration simple.
– keep the number of components and, therefore, the interaction between them low.
– finally, make it easy to build.
I want to stress that Pacemaker has been an excellent component of the Linux HA stack for years and is more flexible than DRBD Reactor and its promoter plugin. So I don’t see it as a rival at all. We just think there are scenarios (i.e., the 99%) where the benefits of a simple DRBD Reactor setup predominates.
DRBD Reactor is completely tied to HA storage that uses DRBD®. So it is not as generic as Pacemaker in that sense. However, the advantage is that DRBD Reactor can use DRBD states (quorum and “may promote”) as its source for decisions. And actually, these two pieces of information are the only sources needed by DRBD Reactor:
– if the DRBD resource can be promoted (i.e., not active anywhere else and has quorum), then promote the DRBD resource and start the user-defined services.
– if quorum is lost, stop the services, demote the DRBD resource, and give another node the chance to promote the resource and start services. If configured and demotion fails, execute some action (poweroff node, reboot node…).
The first consequence of this design is that DRBD Reactor does not need any further cluster communication besides listening to the before-mentioned events. There is no shared ‘cib’ for cluster configuration that needs synchronization as we know it from the Pacemaker stack. The disadvantage is that configuration files need to be copied to all nodes. Overall the benefits of keeping the daemon simple are more important. Usually, higher-level tools are used to distribute the configuration in the cluster, e.g., linstor-gateway.
If one or multiple nodes detect that they can promote the DRBD resource, they all race to promote the resource to Primary. Only one will be able to do that; the other nodes back off. As an implementation detail: Before a node actually tries to become Primary it might sleep a bit depending on its local resource state. With that, we allow a node that has a good state to win the race for DRBD promotion.
Using DRBD quorum helps us keep the promoter plugin simple by using a component that already implements quorum (i.e., DRBD). So we know when we should promote a DRBD resource and start services that depend on it, but how do we start services? Again, we want to keep things as simple as possible. There is already a standard system component that allows us to start and stop services and even group services into “targets”: `systemd.` I would even go as far as to think of the promoter plugin as a very elaborate `systemd` template generator. We will not go into all details of `systemd` template generation. The technically interested user is referred to this in-depth documentation, but a high-level overview will help understand how the promoter plugin works. Let’s assume we want to start the services `a.service`, `b.service,` and `c.service.`
– the plugin generates an override for the `drbd-promote@.service` that is shipped by `drbd-utils.` This makes sure that the resource gets promoted to `Primary` and acts as a dependency for all the other services.
– for every service in the list the plugin creates an override (e.g., `/var/run/systemd/system/a.service.d/reactor.conf`) that specifies a dependency on the `drbd-promote@.service` as well as a dependency on the service preceding it. So `b.service` will depend on `a.service`, and `c.service` will depend on `b.service`.
– all the services are grouped into a `drbd-resources@.target` that acts as a handle for the promoter. After all these service overrides are generated, the plugin starts (or stops) the `drbd-resources@.target` unit, and `systemd` does what it is good at: starting services.
This once more makes use of an existing, widely used, and well-tested component for service management, namely `systemd.` And by using it, we also get all of the power `systemd` provides for free, like reliable `OnFailure` actions.
The last widely-used component for clustering I want to mention is OCF resource agents. The promoter plugin can use them via a little shim that we ship in `drbd-utils,` namely `ocf.ra@.service`. You can find a more detailed overview here.
There is much more to know about the promoter plugin, like which failure scenarios are actually covered, but what we covered here is enough for a motivational blog post. The interested reader is referred to the documentation.
After all the theory, let’s look at a straightforward example: A highly-available file system mount. Of course, we use LINSTOR to create the DRBD resources to keep things simple, but that is not a strict requirement.
The first step is to create a DRBD resource, which, in this case, is three times redundant.
$ linstor resource-group create --place-count 3 promoter
$ linstor resource-group drbd-options promoter --auto-promote no
$ linstor resource-group drbd-options promoter --quorum majority
$ linstor resource-group drbd-options promoter --on-no-quorum io-error$ linstor volume-group create promoter$ linstor resource-group spawn promoter test 20M
Then we want to create a file system we can mount:
$ drbdadm primary test
$ mkfs.ext4 /dev/drbd1000
$ drbdadm secondary test
On all nodes that should be able to mount the file system, we create a mount unit:
$ cat < /etc/systemd/system/mnt-test.mount
Description=Mount /dev/drbd1000 to /mnt/test
Then on all nodes, we also need to create a configuration for the promoter plugin:
$ cat < /etc/drbd-reactor.d/mnt-test.toml
id = "mnt-test"
start = ["mnt-test.mount"]
on-drbd-demote-failure = "reboot"
Let’s do a quick recap of what we get out of that configuration snippet:
Last but not least, we need to start (or restart/reload) DRBD Reactor on all nodes via `systemctl start drbd-reactor.service`.
Then we can check which node is Primary and has the device mounted:
$ drbd-reactorctl status mnt-test
On the node that is Primary, we can do a switch-over, just for testing. A later version of `drbd-reactorctl` might have a dedicated command for that:
$ drbd-reactorctl disable --now mnt-test
$ # another node should be primary now and have the FS mounted
$ drbd-reactorctl enable mnt-test # to re-enable the config again
You can also test a failure scenario by keeping a file open on the mounted device if you want to. Connect to the node that is Primary and execute the following commands. This action should trigger a reboot, and another node should take over the mount.
$ touch /mnt/test/lock
$ sleep 3600 < /mnt/test/lock &
$ # ^^ this creates an opener and the mount unit will be unable to stop
$ # and the DRBD device will be unable to demote
$ systemctl restart firstname.lastname@example.org # trigger a stop/restart of the target
Let’s look back at the goals stated at the beginning of this blog post. Compared to a typical Pacemaker stack, we certainly reduced complexity (at the price of flexibility). Besides a small amount of code, we delegate functionality to well-tested and widely used software components (e.g., DRBD for quorum, `systemd` for service management). DRBD Reactor is a simple, single binary daemon, so we also keep the number of components low, and being implemented in Rust, it is trivial to build for multiple architectures.
We use DRBD Reactor and the promoter plugin for our in-house infrastructure (LINSTOR + OpenNebula) to provide a highly-available LINSTOR controller, which is the configuration we also suggest to our customers. Further, it is used in linstor-gateway and LINBIT VSAN SDS.
Share this post
LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.
By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.
Here’s the new LINSTOR Operator 1.8.2.
This release updates some components, notably the latest LINSTOR 1.18.2.
– LINSTOR 1.18.2
– LINSTOR CSI 0.19.1: fixes issues around cluster with quorum disabled
Access the update here: https://bit.ly/3PECaYF