Using Fencing in Pacemaker Clusters on VirtualBox Hypervisors

This blog post will describe how you can configure the fence_vbox fence agent in high-availability (HA) Pacemaker development clusters running on VirtualBox for Linux. Fencing is an important concept in HA clustering, so using fencing in development as you would in production is a practice that can help align the development experience with that of a user running in production. When I was looking online for resources surrounding the fence_vbox fencing agent I found there really wasn’t much out there, so it seemed like a quick blog might help someone out there, or at the very least, it will help future me.

If you’re familiar with the general idea behind fencing and why it’s important, feel free to skip the next section and get right into configuring fence_vbox for your Pacemaker cluster.

What Fencing Is and Why It Is Important

LINBIT® recommends using node level fencing, also known as STONITH, in all production Pacemaker clusters to ensure that any and all types of system failures will result in a successful failover. Fencing, in the context of HA clustering, is taking a node in an unknown or unrecoverable state and placing it into a known state, therefore ensuring it is safe for a peer node to take over services. In most cases, the “known state” that a fencing device will put a node into is powered off. This guarantees that there are no longer any clients or services accessing the misbehaving node, and services that were running on that node can now be moved to a peer without introducing DRBD® split-brains, data divergence, or generally causing headaches for admins.

When LINBIT publishes blog posts or technical documents pertaining to HA Pacemaker clusters, we typically mention how important fencing or STONITH is, but then leave it disabled or as an exercise for the reader. The reason for these seemingly conflicting messages is because there are 70 different fence agents available for Pacemaker at the time of writing this blog post, and which one is correct for each individual reader, depends on the environment they’re deploying into. In short, which fence agent and how it’s configured will work for some users, but not all. For example, SuperMicro hardware will have generic IPMI interfaces for fencing, while HPE chassis will have ILO interfaces, and APC power hardware has yet another type of interface…

Configuring fencing can be tricky and certainly is not a “one size fits all” type of configuration. Continue on to learn how you can configure fencing in your Pacemaker development clusters running on VirtualBox for Linux.

Configure the fence_vbox Fencing Agent in Pacemaker

There are three basic parts to configuring the fence_vbox fence agent for Pacemaker. First, you’ll need to get the universally unique identifier (UUID) for each of the virtual machines (VMs) running on the hypervisor using the vboxmanage command line utility. Then, you’ll need to configure a user account on each cluster node that can access the hypervisor using SSH. Finally, you’ll configure and enable the fence agent in Pacemaker.

Retrieve VM UUIDs from the Hypervisor

On the hypervisor – likely your workstation – run the vboxmanage command below and record the outputs. You only need the UUIDs for the VMs participating in the Pacemaker cluster. In my example, I’m only interested in the nfs-$i nodes, as those are the nodes that comprise my HA cluster:

$ vboxmanage list runningvms
"nfs-0" {c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f}
"nfs-1" {12a98853-ea6a-4b70-9da1-1b3e93960fd7}
"nfs-2" {7141d127-fd36-4e13-b8d0-72e0a295abd9}
"controller-0" {2ccae0d3-01b3-4031-af14-8ee57137e71b}
"satellite-0" {3e0e7254-97ff-4986-b0c6-32fc834cc949}
"satellite-1" {6fa06d1f-d798-4a1e-89f9-22068122fc34}
"satellite-2" {0878e4bb-a985-439b-9588-99c31313937e}

Verify that you associate each UUID to the correct VM name. If you make a mistake here, the wrong node will be fenced from the cluster when fence actions are called for another node, which is certainly suboptimal.

Setting up a User for Fence Agent

Each node in the cluster will need to execute vboxmanage commands on the hypervisor for fence_vbox to operate correctly. You can use any user on the cluster nodes to do this, but I will create a new user named vbox_stonith to do so. The vbox_stonith user will need SSH access to the user account which created the VMs on the hypervisor machine. In my environment, the user who created the VMs is named matt, and the VMs can communicate with the hypervisor using the virtual bridge’s IP address, which is 192.168.122.1 on my system but could be different on yours.

Create the user on all VMs in your cluster using the following commands:

# mkdir /home/vbox_stonith
# useradd vbox_stonith -s /bin/bash -c "vbox STONITH account"
# chown -R vbox_stonith:vbox_stonith /home/vbox_stonith
# passwd vbox_stonith

With the user created, you can switch to the new user to create and copy SSH keys onto the hypervisor for “passwordless” SSH access:

# su vbox_stonith
$ ssh-keygen
$ ssh-copy-id [email protected]

NOTE: The commands above create entries in the hypervisor user’s authorized_keys file. Once you’re done with these VMs, you’ll want to remove those entries to keep things tidy on your hypervisor.

Test logging into the hypervisor from each VM’s vbox_stonith user’s account to the hypervisor user’s account, being sure to accept the hypervisor’s SSH fingerprint on each VM. You should not be prompted to accept the fingerprint for subsequent logins.

$ ssh [email protected]

If you’re not prompted for a password and successfully logged into the hypervisor, disconnect (exit), and you’re ready to continue.

Configuring Fence Agents in the Pacemaker Cluster

Finally, you’re ready to configure fencing in your cluster. Depending on your cluster’s distribution and which packages are already installed, you may or may not already have installed the package providing the fence_vbox agent.

If you already have the /usr/sbin/fence_vbox fence agent on all cluster nodes you can skip package installation and move straight to configuring the agents in your cluster. Otherwise, find and install the correct package on each cluster node using your distributions package manager.

If you’re on a DNF-based distribution you can install the package identified by the following command:

# dnf provides */fence_vbox
# dnf install <package-name>

If you’re on an APT-based distribution you can install the package identified by the following command:

# apt-file search fence_vbox
# apt install <package-name>

With fence_vbox present on all nodes in the cluster, you can add the agents to the Pacemaker configuration. The commands below only need to be run from a single cluster node, as Pacemaker and Corosync will distribute the configuration changes to all other cluster nodes for you.

In the configurations below we’ll be setting parameters on each fence agent which will be unique to your environment. Each VM will have its own fence agent. My VM nodes are named, nfs-0nfs-1, and nfs-2, so I’ve named the corresponding fence agents st_nfs-0st_nfs-1, and st_nfs-2, respectively. You’ll also see we’re setting location constraints on the fence agents to verify that we never run the agent responsible for managing a cluster node on its respective cluster node (nodes should never be trusted to fence themselves in HA clusters).

This list describes each parameter to be configured on each fence agent within the cluster:

  • ip: the IP address the cluster nodes use to communicate with the hypervisor
  • login: the username on the hypervisor that the fence agents SSH into
  • plug: the UUID for the VM we’re configuring the fence agent for
  • pcmk_host_list: the VM’s hostname as used in the Pacemaker cluster (crm_node -n)
  • identity_file: the SSH key created for and used by the vbox_stonith user to login to the hypervisor

If you’re using CRM shell to configure your Pacemaker cluster, enter the crm configure shell and make the configurations below.

# crm configure
crm(live/nfs-1)configure# primitive st_nfs-0 stonith:fence_vbox \
        params ip=192.168.122.1 login=matt plug=c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f \
        pcmk_host_list=nfs-0 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# primitive st_nfs-1 stonith:fence_vbox \
        params ip=192.168.122.1 login=matt plug=12a98853-ea6a-4b70-9da1-1b3e93960fd7 \
        pcmk_host_list=nfs-1 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# primitive st_nfs-2 stonith:fence_vbox \
        params ip=192.168.122.1 login=matt plug=7141d127-fd36-4e13-b8d0-72e0a295abd9 \
        pcmk_host_list=nfs-2 identity_file="/home/vbox_stonith/.ssh/id_rsa"
crm(live/nfs-1)configure# location l_st-nfs-0_neveron_nfs-0 st_nfs-0 -INF: nfs-0
crm(live/nfs-1)configure# location l_st-nfs-1_neveron_nfs-1 st_nfs-1 -INF: nfs-1
crm(live/nfs-1)configure# location l_st-nfs-2_neveron_nfs-2 st_nfs-2 -INF: nfs-2
crm(live/nfs-1)configure# property stonith-enabled=true 
crm(live/nfs-1)configure# commit
crm(live/nfs-1)configure# quit

If you’re using PCS to configure your Pacemaker cluster, use the pcs command line utility to make the configurations below:

# pcs cluster cib add_fencing
# pcs -f add_fencing stonith create st_nfs-0 fence_vbox \
    ip=192.168.122.1 login=matt plug=c4619b35-dfee-4aba-a7e6-6b6d7a16bd6f \
    pcmk_host_list=nfs-0 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing stonith create st_nfs-1 fence_vbox \
    ip=192.168.122.1 login=matt plug=12a98853-ea6a-4b70-9da1-1b3e93960fd7 \
    pcmk_host_list=nfs-1 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing stonith create st_nfs-2 fence_vbox \
    ip=192.168.122.1 login=matt plug=7141d127-fd36-4e13-b8d0-72e0a295abd9 \
    pcmk_host_list=nfs-2 identity_file="/home/vbox_stonith/.ssh/id_rsa"
# pcs -f add_fencing constraint location st_nfs-0 avoids nfs-0
# pcs -f add_fencing constraint location st_nfs-1 avoids nfs-1
# pcs -f add_fencing constraint location st_nfs-2 avoids nfs-2
# pcs -f add_fencing property set stonith-enabled=true
# pcs cluster cib-push add_fencing

Fencing Configuration Verification and Testing

You should now have fencing configured and enabled within your development cluster, congrats!

Your crm_mon output should now look something like this:

# crm_mon -1r
Cluster Summary:
  * Stack: corosync
  * Current DC: nfs-0 (version 2.0.5.linbit-1.0.el8-ba59be712) - partition with quorum
  * Last updated: Tue Jan 10 00:22:49 2023
  * Last change:  Fri Jan  6 16:56:01 2023 by root via cibadmin on nfs-1
  * 3 nodes configured
  * 10 resource instances configured

Node List:
  * Online: [ nfs-0 nfs-1 nfs-2 ]

Full List of Resources:
  * Resource Group: g_nfs:
    * p_fs_drbd (ocf::heartbeat:Filesystem):     Started nfs-1
    * p_nfsserver       (ocf::heartbeat:nfsserver):      Started nfs-1
    * p_exportfs_root   (ocf::heartbeat:exportfs):       Started nfs-1
    * p_vip_ip  (ocf::heartbeat:IPaddr2):        Started nfs-1
  * Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
    * Masters: [ nfs-1 ]
    * Slaves: [ nfs-0 nfs-2 ]
  * st_nfs-0    (stonith:fence_vbox):    Started nfs-2
  * st_nfs-1    (stonith:fence_vbox):    Started nfs-0
  * st_nfs-2    (stonith:fence_vbox):    Started nfs-0

There are plenty of ways to test that fencing is in fact working, and it’s always a good practice to test a few cases. I usually start with terminating the Corosync process (use: pkill -9 corosync) on a cluster node, which from the perspective of the cluster looks like a node just went missing without warning. This node should be fenced out of the cluster by one of its peers. If you’ve configured Pacemaker and Corosync to start at boot, the fenced node should reboot and rejoin the cluster.

Some other methods of testing could be using iptables to block network traffic in and out of a node, simply unplugging the virtual network interface, or if you’ve configured your filesystem resource agents to monitor that I/O operations are succeeding you could use fsfreeze to freeze the file system simulating an issue with storage. Using fsfreeze is an interesting case, since Pacemaker will see the monitor operations failing, attempt to migrate services off of the frozen node, which should fail because Pacemaker cannot unmount a frozen file system. That’s a simulated stop operation failure, which is one of the situations only fencing can help a cluster recover from without human intervention.

Conclusion

Hopefully this blog post has shown you how you can use fencing in your development clusters, or perhaps it has given you some ideas on how you can try breaking your development clusters to test your cluster’s fence agents. If anything in this blog doesn’t work as written in your environment, or you need more information than what’s here, don’t hesitate to reach out! LINBIT is always interested in feedback from the community. If you might be interested in learning how you can use the DRBD quorum feature as an alternative fencing implementation in Pacemaker clusters for DRBD and DRBD-constrained resources, you can read this LINBIT blog article on the topic. Using the DRBD quorum feature as an alternative to configuring fencing in Pacemaker clusters can be easier to configure and understand in deployments where you can use it.

Matt Kereczman

Matt Kereczman

Matt Kereczman is a Solutions Architect at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT's technical team, and plays an important role in making LINBIT and LINBIT's customer's solutions great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt's hobbies.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.