Besides helping to make your applications and services highly available, one of the lesser known benefits of using DRBD® is using it to help minimize downtime when you are doing system maintenance. Another blog post gives some background and an example workflow for this use case. For simplicity, that blog post only provides an overview of technical instructions. This blog post will examine the use case in more detail and provide specific cluster resource manager (CRM) instructions. First, instructions when using Pacemaker as a CRM will be given, followed by instructions when using DRBD Reactor.
In the example use case from the other blog post, you are a system administrator who needs to upgrade RAM and update a system board’s firmware on two nodes. An overview of the workflow is as follows:
- First, put your CRM into maintenance mode. This is similar to a pause button. You are telling the CRM not to do anything that it might ordinarily do when systems go down.
- Next power off
node2(the secondary server).
- You can now install any hardware that you need to, for example, new RAM.
- Power the system up and enter the BIOS or UEFI and verify that the system recognizes the new RAM.
- You can now update the system board firmware, which might require a reboot. Boot
node2into its operating system. At this point you can install any operating system or software updates and reboot, if you need to.
node2is back up and you have verified that everything is in good condition, including that your DRBD resource or resources are in an up-to-date state, bring the CRM out of maintenance mode.
- Next, migrate services to
node2– depending on how you have configured things, this might cause a few seconds of unavailability of services.
- Verify that services that are now running on
- Repeat steps 1-4 for
Proceeding With Caution
This article takes a cautious approach towards doing system maintenance and describes putting your CRM into a “maintenance mode” first. When the CRM is in maintenance mode it will not start, stop, or monitor services. However, services do continue to run on the current primary node while a CRM is in maintenance mode. Maintenance tasks such as updating software packages on a running system might have an effect that could trigger a CRM monitoring event. If the CRM was not in maintenance mode, such an event could cause the CRM to react, which in turn could affect your cluster resources, applications, or services and affect users.
A tradeoff when taking this approach to system maintenance is that should a network or other event happen to your cluster that would ordinarily trigger a failover, that automatic failover will not happen while the CRM is in maintenance mode.
It is certainly possible to do system maintenance tasks in a production cluster without putting the CRM into maintenance mode. To take this approach, you would typically want to do maintenance on your secondary node or nodes first, then migrate services from the primary node to a secondary node, and then do maintenance on the node that was formerly primary. An advantage to doing things this way, for example, in a three-node Pacemaker or DRBD Reactor cluster, is that if there is a problem that causes a failover event, failover will happen automatically. The risk is that a maintenance task that might trigger a failover event will disrupt user experience during the failover time, that is, the time that it takes the CRM to stop services on one node and start them on another. Another risk is that if a failover event happens and the CRM tries to fail services over to a secondary node that you are doing maintenance tasks on, the consequences could be unpredictable. To prevent this, if you do system maintenance without first putting the CRM into maintenance mode, you should at least temporarily disable (when using DRBD Reactor) or put into standby (when using Pacemaker) a secondary node, before doing maintenance tasks on that node. By doing this, you would prevent the CRM from failing services over to that node.
Instructions for Using Pacemaker When Doing System Maintenance
Here are the
pcs command line shell instructions when you are using Pacemaker as your CRM.
Putting a Pacemaker Cluster Into Maintenance Mode
To put your Pacemaker-managed cluster into maintenance mode, enter the following command:
# pcs property set maintenance-mode=true
This command tells Pacemaker to stop managing resources and services, for example, stopping, starting, or monitoring them. However, while in maintenance mode, resources and services will continue to be active on the current node,
You can verify that your cluster is in maintenance mode by entering a
pcs status command. Output of the command should be similar to this:
Cluster name: linbit-cluster
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
* Online: [ node1 node2 ]
Full List of Resources:
* Resource Group: g_nfs (unmanaged):
* p_fs_drbd (ocf:heartbeat:Filesystem): Started node2 (unmanaged)
* p_nfsserver (ocf:heartbeat:nfsserver): Started node2 (unmanaged)
* p_exportfs_root (ocf:heartbeat:exportfs): Started node2 (unmanaged)
* p_vip_ip (ocf:heartbeat:IPaddr2): Started node2 (unmanaged)
* Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable, unmanaged):
* p_drbd_r0 (ocf:linbit:drbd): Promoted node2 (unmanaged)
* p_drbd_r0 (ocf:linbit:drbd): Unpromoted node1 (unmanaged)
Bringing a Pacemaker Cluster Out of Maintenance Mode
After doing your system maintenance on the secondary node,
node2, and verifying that things are good on that node, you can bring the cluster out of maintenance mode. To do that, enter the following command:
# pcs property set maintenance-mode=false
Bringing the CRM out of maintenance mode will again allow Pacemaker to manage services and resources. You can enter another
pcs status command to verify that Pacemaker is no longer in maintenance mode.
Migrating Services in a Pacemaker Cluster
Before you can do system maintenance tasks on the primary node, the node that is currently hosting services and resources,
node1, you need to migrate these to
node2. To do this using
pcs, enter the following command:
# pcs node standby node1
When you put
node1, the cluster node that currently hosts resources, into standby mode, Pacemaker will automatically migrate all resources to another node. In the case of the two-node cluster in this example, Pacemaker will migrate the resources to
node2. You can verify this by entering a
pcs status command. Output should show something similar to the following:
* Node node1: standby
* Online: [ node2 ]
Full List of Resources:
* Resource Group: g_nfs:
* p_fs_drbd (ocf:heartbeat:Filesystem): Started node2
* p_nfsserver (ocf:heartbeat:nfsserver): Started node2
* p_exportfs_root (ocf:heartbeat:exportfs): Started node2
* p_vip_ip (ocf:heartbeat:IPaddr2): Started node2
* Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
* Promoted: [ node2 ]
* Stopped: [ node1 ]
At this point, you should further verify that your applications and services are running and reachable on the expected node, in this example,
Doing System Maintenance On the Next Node When Using Pacemaker
After verifying that your applications and services are running on
node2, you can put the cluster into maintenance mode and repeat previous instructions to do system maintenance tasks on
node1. After doing system maintenance tasks on
node1, you can take the cluster out of maintenance mode, by entering the
pcs command shown above. Finally, you can take
node1 out of standby mode by entering the following command:
# pcs node unstandby node1
If you enter another
pcs status command, you can verify that the output no longer shows
node1 in standby mode.
Instructions for Using DRBD Reactor When Doing System Maintenance
DRBD Reactor is a daemon that monitors DRBD events and reacts to them. It has plugins for various use cases, such as monitoring DRBD resources, making metrics available, and creating failover clusters to provide highly available services. If you want more background, you can read about the high availability (HA) use case in this blog post.
For an example scenario, assume a three-node DRBD Reactor cluster where all nodes have direct attached storage and are all capable of running managed services. The nodes in the cluster have the following hostnames:
In this setup, DRBD Reactor and its promoter plugin manage two file system mounts, an NFS server instance with one NFS export, a virtual IP address (
192.168.222.199) service, and a port block service. You can find the full details of this HA NFS setup in a how-to technical guide available on the LINBIT® website.
📝 NOTE: Because DRBD Reactor relies on the DRBD quorum feature, you need at least a three-node cluster to use it. However, a third node does not need to have additional direct attached storage. The third node can run disklessly and does not need to be able to run applications or services. It only needs to act as a tie-breaker node for DRBD quorum purposes. In that sense, for cost-saving reasons, you can use a relatively underpowered host such as a Raspberry Pi, a cloud instance, or a VM with minimal resources as your third node.
Similar to the previous scenario that used Pacemaker, assume that you also need to upgrade memory and the system board’s firmware in each of your DRBD Reactor cluster nodes.
Preparing a DRBD Reactor Cluster for System Maintenance
The DRBD Reactor promoter plugin is the plugin that allows DRBD Reactor to start and stop services on a node. When used in a cluster of nodes, the promoter plugin lets you use DRBD Reactor as a simplified CRM. Working as a simplified CRM, DRBD Reactor does not have the equivalent of the Pacemaker maintenance mode.
Before doing system maintenance, you need to take note of which node currently hosts services. Then you will need to disable the promoter plugin configuration that is managing your HA NFS services.
Verifying Which Node Is Currently Hosting Services
Before disabling the HA NFS promoter plugin across your cluster, enter a
drbd-reactorctl status command on any of your cluster nodes to determine which node is actively running the HA NFS services.
For example, when entered on the
reactornfs-1 node, output from a
status command might show that the services are running on the
Promoter: Currently active on node 'reactornfs-0'
○ [email protected]
× ├─ [email protected]
○ ├─ ocf.rs@portblock_nfs.service
○ ├─ ocf.rs@fs_cluster_private_nfs.service
○ ├─ ocf.rs@fs_1_nfs.service
○ ├─ ocf.rs@nfsserver_nfs.service
○ ├─ ocf.rs@export_1_0_nfs.service
○ ├─ ocf.rs@service_ip_nfs.service
○ └─ ocf.rs@portunblock_nfs.service
Disabling the Promoter Plugin in Your Cluster
Before doing system maintenance, you will need to disable the promoter plugin that is managing your HA NFS services in your cluster. To do this, assuming your promoter plugin configuration file exists at
/etc/drbd-reactor.d/nfs.toml, enter the following command on each of your cluster nodes:
# drbd-reactorctl disable nfs
After making this change, DRBD Reactor will no longer be managing the NFS HA services, however, the services will still be running on the active node. This gives you your maintenance window.
You can verify that your virtual IP address,
192.168.222.199 in this example, is still serving the NFS mount, by entering a
showmount -e 192.168.222.199 command on any host on the
192.168.222.0/24 network. Output should be similar to this:
Export list for 192.168.222.199:
Doing System Maintenance Tasks on Your Cluster Nodes
You can now follow steps 2. and 3. from the list of maintenance steps from the example overview at the beginning of this blog post for the nodes that are not currently hosting HA NFS services. In this example, those nodes are
Enabling the Promoter Plugin in Your Cluster
After doing your system maintenance tasks on the two inactive nodes, and verifying that things are good, you can enable the promoter plugin in your cluster. Enter the following command on each cluster node to do this:
# drbd-reactorctl enable nfs
You can verify that DRBD Reactor and the promoter plugin can again stop and start resources in your cluster by entering the following command on each of your cluster nodes
# drbd-reactorctl status nfs
Output should show something similar to the following on your active node, in this example
Promoter: Currently active on this node
● [email protected]
● ├─ [email protected]
● ├─ ocf.rs@portblock_nfs.service
● ├─ ocf.rs@fs_cluster_private_nfs.service
● ├─ ocf.rs@fs_1_nfs.service
● ├─ ocf.rs@nfsserver_nfs.service
● ├─ ocf.rs@export_1_0_nfs.service
● ├─ ocf.rs@service_ip_nfs.service
● └─ ocf.rs@portunblock_nfs.service
On the two inactive nodes, the command output should be similar to the output shown for the
reactornfs-1 node in the previous “Verifying Which Node Is Currently Hosting Services” section.
Migrating Resources From the Active Node to a Secondary Node
Next, to do system maintenance tasks on
reactornfs-0, the node that currently hosts the HA NFS services, you need to migrate the services to another node on which you have already done system maintenance tasks. To do this, enter the following command on
# drbd-reactorctl evict nfs
Output from the command should show that the
nfs promoter plugin managed services have migrated to another node.
Created symlink /run/systemd/system/[email protected] → /dev/null.
Node 'reactornfs-1' took over
Removed /run/systemd/system/[email protected].
NOTE: Depending on the type of services that you are managing with DRBD Reactor, there might be some downtime when DRBD Reactor stops services on
reactornfs-0and starts them on
reactornfs-1. This is known as failover time. You should experiment with failovers in a test cluster before trying these commands in a production cluster.
You can further verify that resources have failed over to another node, by entering the following command on
# drbd-reactorctl status nfs
Output should show that another node in the cluster now hosts the
Promoter: Currently active on node 'reactornfs-1'
After verifying that services are running on another node in your cluster,
reactornfs-1 in this example, you can proceed to disable the promoter plugin within your DRBD Reactor cluster, as described earlier in the “Disabling the Promoter Plugin in Your Cluster” section. You can then do system maintenance on the
reactornfs-0 node, verify that things are good, and then repeat the instructions in the “Enabling the Promoter Plugin in Your Cluster” section.
As discussed in the “Proceeding With Caution” section, this article takes a cautious approach to doing system maintenance tasks within an HA cluster. If you have questions about the approach that might be right for your use case, you can contact the experts at LINBIT.
Hopefully this article inspires you to take on system maintenance tasks within your HA clusters with confidence. Whether you use Pacemaker or DRBD Reactor to achieve high availability, by having your storage resources backed by DRBD replication technology, you can minimize system downtime while keeping your systems patched and secure, which in turn will keep your users and customers safe and happy.