This blog post will describe how you can configure the fence_vbox
fence agent in high-availability (HA) Pacemaker development clusters running on VirtualBox for Linux. Fencing is an important concept in HA clustering, so using fencing in development as you would in production is a practice that can help align the development experience with that of a user running in production. When I was looking online for resources surrounding the fence_vbox
fencing agent I found there really wasn’t much out there, so it seemed like a quick blog might help someone out there, or at the very least, it will help future me.
If you’re familiar with the general idea behind fencing and why it’s important, feel free to skip the next section and get right into configuring fence_vbox
for your Pacemaker cluster.
What Fencing Is and Why It Is Important
LINBIT® recommends using node level fencing, also known as STONITH, in all production Pacemaker clusters to ensure that any and all types of system failures will result in a successful failover. Fencing, in the context of HA clustering, is taking a node in an unknown or unrecoverable state and placing it into a known state, therefore ensuring it is safe for a peer node to take over services. In most cases, the “known state” that a fencing device will put a node into is powered off. This guarantees that there are no longer any clients or services accessing the misbehaving node, and services that were running on that node can now be moved to a peer without introducing DRBD® split-brains, data divergence, or generally causing headaches for admins.
When LINBIT publishes blog posts or technical documents pertaining to HA Pacemaker clusters, we typically mention how important fencing or STONITH is, but then leave it disabled or as an exercise for the reader. The reason for these seemingly conflicting messages is because there are 70 different fence agents available for Pacemaker at the time of writing this blog post, and which one is correct for each individual reader, depends on the environment they’re deploying into. In short, which fence agent and how it’s configured will work for some users, but not all. For example, SuperMicro hardware will have generic IPMI interfaces for fencing, while HPE chassis will have ILO interfaces, and APC power hardware has yet another type of interface…
Configuring fencing can be tricky and certainly is not a “one size fits all” type of configuration. Continue on to learn how you can configure fencing in your Pacemaker development clusters running on VirtualBox for Linux.
Configure the fence_vbox
Fencing Agent in Pacemaker
There are three basic parts to configuring the fence_vbox
fence agent for Pacemaker. First, you’ll need to get the universally unique identifier (UUID) for each of the virtual machines (VMs) running on the hypervisor using the vboxmanage
command line utility. Then, you’ll need to configure a user account on each cluster node that can access the hypervisor using SSH. Finally, you’ll configure and enable the fence agent in Pacemaker.
Retrieve VM UUIDs from the Hypervisor
On the hypervisor – likely your workstation – run the vboxmanage
command below and record the outputs. You only need the UUIDs for the VMs participating in the Pacemaker cluster. In my example, I’m only interested in the nfs-$i
nodes, as those are the nodes that comprise my HA cluster:
$ vboxmanage list runningvms
"nfs-0" {c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f}
"nfs-1" {12a98853-ea6a-4b70-9da1-1b3e93960fd7}
"nfs-2" {7141d127-fd36-4e13-b8d0-72e0a295abd9}
"controller-0" {2ccae0d3-01b3-4031-af14-8ee57137e71b}
"satellite-0" {3e0e7254-97ff-4986-b0c6-32fc834cc949}
"satellite-1" {6fa06d1f-d798-4a1e-89f9-22068122fc34}
"satellite-2" {0878e4bb-a985-439b-9588-99c31313937e}
Verify that you associate each UUID to the correct VM name. If you make a mistake here, the wrong node will be fenced from the cluster when fence actions are called for another node, which is certainly suboptimal.
Setting up a User for Fence Agent
Each node in the cluster will need to execute vboxmanage
commands on the hypervisor for fence_vbox
to operate correctly. You can use any user on the cluster nodes to do this, but I will create a new user named vbox_stonith
to do so. The vbox_stonith
user will need SSH access to the user account which created the VMs on the hypervisor machine. In my environment, the user who created the VMs is named matt
, and the VMs can communicate with the hypervisor using the virtual bridge’s IP address, which is 192.168.122.1
on my system but could be different on yours.
Create the user on all VMs in your cluster using the following commands:
# mkdir /home/vbox_stonith
# useradd vbox_stonith -s /bin/bash -c "vbox STONITH account"
# chown -R vbox_stonith:vbox_stonith /home/vbox_stonith
# passwd vbox_stonith
With the user created, you can switch to the new user to create and copy SSH keys onto the hypervisor for “passwordless” SSH access:
# su vbox_stonith
$ ssh-keygen
$ ssh-copy-id [email protected]
NOTE: The commands above create entries in the hypervisor user’s authorized_keys
file. Once you’re done with these VMs, you’ll want to remove those entries to keep things tidy on your hypervisor.
Test logging into the hypervisor from each VM’s vbox_stonith
user’s account to the hypervisor user’s account, being sure to accept the hypervisor’s SSH fingerprint on each VM. You should not be prompted to accept the fingerprint for subsequent logins.
$ ssh [email protected]
If you’re not prompted for a password and successfully logged into the hypervisor, disconnect (exit
), and you’re ready to continue.
Configuring Fence Agents in the Pacemaker Cluster
Finally, you’re ready to configure fencing in your cluster. Depending on your cluster’s distribution and which packages are already installed, you may or may not already have installed the package providing the fence_vbox
agent.
If you already have the /usr/sbin/fence_vbox
fence agent on all cluster nodes you can skip package installation and move straight to configuring the agents in your cluster. Otherwise, find and install the correct package on each cluster node using your distributions package manager.
If you’re on a DNF-based distribution you can install the package identified by the following command:
# dnf provides */fence_vbox
# dnf install <package-name>
If you’re on an APT-based distribution you can install the package identified by the following command:
# apt-file search fence_vbox
# apt install <package-name>
With fence_vbox
present on all nodes in the cluster, you can add the agents to the Pacemaker configuration. The commands below only need to be run from a single cluster node, as Pacemaker and Corosync will distribute the configuration changes to all other cluster nodes for you.
In the configurations below we’ll be setting parameters on each fence agent which will be unique to your environment. Each VM will have its own fence agent. My VM nodes are named, nfs-0
, nfs-1
, and nfs-2
, so I’ve named the corresponding fence agents st_nfs-0
, st_nfs-1
, and st_nfs-2
, respectively. You’ll also see we’re setting location constraints on the fence agents to verify that we never run the agent responsible for managing a cluster node on its respective cluster node (nodes should never be trusted to fence themselves in HA clusters).
This list describes each parameter to be configured on each fence agent within the cluster:
ip
: the IP address the cluster nodes use to communicate with the hypervisorlogin
: the username on the hypervisor that the fence agents SSH intoplug
: the UUID for the VM we’re configuring the fence agent forpcmk_host_list
: the VM’s hostname as used in the Pacemaker cluster (crm_node -n
)identity_file
: the SSH key created for and used by thevbox_stonith
user to login to the hypervisor
If you’re using CRM shell to configure your Pacemaker cluster, enter the crm configure
shell and make the configurations below.
# crm configure
crm(live/nfs-1)configure# primitive st_nfs-0 stonith:fence_vbox \
params ip=192.168.122.1 login=matt plug=c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f \
pcmk_host_list=nfs-0 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# primitive st_nfs-1 stonith:fence_vbox \
params ip=192.168.122.1 login=matt plug=12a98853-ea6a-4b70-9da1-1b3e93960fd7 \
pcmk_host_list=nfs-1 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# primitive st_nfs-2 stonith:fence_vbox \
params ip=192.168.122.1 login=matt plug=7141d127-fd36-4e13-b8d0-72e0a295abd9 \
pcmk_host_list=nfs-2 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# location l_st-nfs-0_neveron_nfs-0 st_nfs-0 -INF: nfs-0
crm(live/nfs-1)configure# location l_st-nfs-1_neveron_nfs-1 st_nfs-1 -INF: nfs-1
crm(live/nfs-1)configure# location l_st-nfs-2_neveron_nfs-2 st_nfs-2 -INF: nfs-2
crm(live/nfs-1)configure# property stonith-enabled=true
crm(live/nfs-1)configure# commit
crm(live/nfs-1)configure# quit
If you’re using PCS to configure your Pacemaker cluster, use the pcs
command line utility to make the configurations below:
# pcs cluster cib add_fencing
# pcs -f add_fencing stonith create st_nfs-0 fence_vbox \
ip=192.168.122.1 login=matt plug=c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f \
pcmk_host_list=nfs-0 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing stonith create st_nfs-1 fence_vbox \
ip=192.168.122.1 login=matt plug=12a98853-ea6a-4b70-9da1-1b3e93960fd7 \
pcmk_host_list=nfs-1 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing stonith create st_nfs-2 fence_vbox \
ip=192.168.122.1 login=matt plug=7141d127-fd36-4e13-b8d0-72e0a295abd9 \
pcmk_host_list=nfs-2 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing constraint location st_nfs-0 avoids nfs-0
# pcs -f add_fencing constraint location st_nfs-1 avoids nfs-1
# pcs -f add_fencing constraint location st_nfs-2 avoids nfs-2
# pcs -f add_fencing property set stonith-enabled=true
# pcs cluster cib-push add_fencing
Fencing Configuration Verification and Testing
You should now have fencing configured and enabled within your development cluster, congrats!
Your crm_mon
output should now look something like this:
# crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: nfs-0 (version 2.0.5.linbit-1.0.el8-ba59be712) - partition with quorum
* Last updated: Tue Jan 10 00:22:49 2023
* Last change: Fri Jan 6 16:56:01 2023 by root via cibadmin on nfs-1
* 3 nodes configured
* 10 resource instances configured
Node List:
* Online: [ nfs-0 nfs-1 nfs-2 ]
Full List of Resources:
* Resource Group: g_nfs:
* p_fs_drbd (ocf::heartbeat:Filesystem): Started nfs-1
* p_nfsserver (ocf::heartbeat:nfsserver): Started nfs-1
* p_exportfs_root (ocf::heartbeat:exportfs): Started nfs-1
* p_vip_ip (ocf::heartbeat:IPaddr2): Started nfs-1
* Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
* Masters: [ nfs-1 ]
* Slaves: [ nfs-0 nfs-2 ]
* st_nfs-0 (stonith:fence_vbox): Started nfs-2
* st_nfs-1 (stonith:fence_vbox): Started nfs-0
* st_nfs-2 (stonith:fence_vbox): Started nfs-0
There are plenty of ways to test that fencing is in fact working, and it’s always a good practice to test a few cases. I usually start with terminating the Corosync process (use: pkill -9 corosync
) on a cluster node, which from the perspective of the cluster looks like a node just went missing without warning. This node should be fenced out of the cluster by one of its peers. If you’ve configured Pacemaker and Corosync to start at boot, the fenced node should reboot and rejoin the cluster.
Some other methods of testing could be using iptables
to block network traffic in and out of a node, simply unplugging the virtual network interface, or if you’ve configured your filesystem
resource agents to monitor that I/O operations are succeeding you could use fsfreeze
to freeze the file system simulating an issue with storage. Using fsfreeze
is an interesting case, since Pacemaker will see the monitor operations failing, attempt to migrate services off of the frozen node, which should fail because Pacemaker cannot unmount a frozen file system. That’s a simulated stop operation failure, which is one of the situations only fencing can help a cluster recover from without human intervention.
Conclusion
Hopefully this blog post has shown you how you can use fencing in your development clusters, or perhaps it has given you some ideas on how you can try breaking your development clusters to test your cluster’s fence agents. If anything in this blog doesn’t work as written in your environment, or you need more information than what’s here, don’t hesitate to reach out! LINBIT is always interested in feedback from the community. If you might be interested in learning how you can use the DRBD quorum feature as an alternative fencing implementation in Pacemaker clusters for DRBD and DRBD-constrained resources, you can read this LINBIT blog article on the topic. Using the DRBD quorum feature as an alternative to configuring fencing in Pacemaker clusters can be easier to configure and understand in deployments where you can use it.