This blog post will show how DRBD’s recently added support for
secondary --force can be of great use when combined with DRBD Reactor. In short, secondary-force reconfigures the DRBD device so that the suspended I/O requests and newly submitted I/O requests will terminate with I/O errors.
Imagine a three-node setup where your current active node loses the connection to the other nodes. This is detected by DRBD’s quorum feature, and
drbd-reactor::promoter can act, but the choices have been limited, and results might be undesirable. Let’s assume the realized service was a simple HA file system mount. And now also imagine that the file system mount was actively used when quorum was lost. Then the promoter plugin for that resource would detect that quorum was lost, try to stop the services, and then demote the DRBD device. The service would have been a
systemd mount unit in our simple case. As there are still users of the file system, the
umount command would have failed, and later in the chain, demoting the DRBD device would have also failed. So how do you recover a cluster from such a situation? Usually, that is done via some failure action. Indeed, for every resource managed by the promoter plugin, one can set a systemd failure action like “reboot” via the
on-drbd-demote-failure configuration option. Rebooting the node can be a good recovery strategy, but what if you have hundreds of DRBD resources active on that node? Is then killing
n-1 important resources worth the one unimportant one that was blocked?
The implementation is relatively simple: Whenever we usually call
drbdsetup secondary, we now call first
drbdsetup secondary and then
drbdsetup secondary --force. Keeping the first call has the advantage that logs are nicer to read. As an admin, one sees that the usual
drbdsetup secondary failed before
secondary --force was called. Using
secondary --force is the new default, and one must set the
secondary-force = false in the promoter plugin’s configuration to disable it.
We start with the usual example of a systemd mount unit as a service. Please follow the rest of the example and the discussion of why using a mount unit in this scenario isn’t a good choice, but let’s build up our knowledge step by step. For example, a configuration might look like this:
[[promoter]] id = "mnt-test" [promoter.resources.test] start = ["mnt-test.mount"] on-drbd-demote-failure = "reboot" secondary-force = false
Note that we intentionally set
false to simulate the behavior as it was. If we then open the device, in our case, a read-only opener via
sleep 3600 </dev/drbd1000 on the active node and then isolate the node via
for i in INPUT OUTPUT; do iptables -A $i -p tcp --dport 7000 -j REJECT; done we will see that the node reboots. Expected, but our imaginary hundred other resources got affected as well.
So, what if we now delete the
secondary-force config, reboot, and try the same again? We see that
secondary --force was executed and that the device got reconfigured to return I/O errors, that it is secondary, and did not reboot. Great, we even see another node of those still having quorum started the service. Unfortunately, our original DRBD
Primary node will never reintegrate (well, maybe after the
3600 seconds elapsed). Why is that? Usually, it is fine to use systemd mount units if there is some service on top of that which uses the mount unit and systemd is aware. A typical case would be a highly-available LINSTOR controller, that would have a start list of
start = ["var-lib-linstor.mount", "linstor-controller.service"]. Then on quorum loss systemd would have made sure that the controller service gets stopped (or killed), and the device would have been unmounted as all its users have been stopped.
But what can we do if the mount point is the final service? Then we have all kinds of users about which systemd does not know and can’t do anything. It’s like our read-only sleeper that idles around and blocks the device from being unmounted. Note that if the process did any I/O (we would have any I/O pending), it would receive I/O errors and hopefully terminate. So, in our edge case where the device’s opener idles around, the answer is not using a systemd mount unit if would be the top service, but an OCF file system resource agent instead. This has all kinds of tricks built in to ensure all users of a file system mount get found and terminated. So, in our scenario, we would use a start list like this:
start = ["ocf:heartbeat:Filesystem fs_test device=/dev/drbd1000 directory=/mnt/test fstype=ext4 run_fsck=no"]
Using this, and again blocking all traffic via
iptables, we see that the node that lost quorum cleanly demotes does not reboot and that another node takes over the service. After an
iptables -F the node is ready to be a target for the next fail/switch-over.
In this blog post, we saw two things: DRBD’s secondary-force feature is a significant improvement to handle isolated DRBD-Primaries more benign. This can make the difference between rebooting a node with hundreds of essential resources or just demoting a single DRBD device. The second observation was that there still can be edge cases where secondary-force itself does not solve all issues. For example, secondary-force configures the device to return I/O errors in the hope that users of the device terminate when they receive such I/O errors, but that does not help if, as in our example, a read-only opener idles around forever. We combined such an idle opener with a file system mount and saw that we need some components to ensure these openers vanish. In our example, we replaced the systemd mount unit with an OCF file system resource agent.