System maintenance, whether you plan for it or do it in response to failure, is a necessary part of managing infrastructure. Everyone hopes to only have to do planned-for system maintenance. The reality is that with so many pieces of a system puzzle, you also have to plan for failures. Regularly maintaining your systems, just like regularly maintaining a vehicle, is one way that you can minimize the likelihood of failures. We do our system maintenance quarterly here at LINBIT® in hopes that we avoid failures. These maintenance windows are where we install hardware, update software, test failovers, and give everything a health check to ensure that configurations still make sense.
One of the challenges of system maintenance is that it can mean system downtime, even when you plan for that maintenance. Your users are then left waiting for access to services while your IT department does whatever it needs to do. This leads to a bad user experience. In fact, that is precisely what led to this blog post. I was looking for a firmware update for a system board in the server room when this message from the manufacturer’s website greeted me:
I just had a bad user experience. And to further the experience, I have no sign from the webpage as to when it will be back up or available. I guess I’m supposed to keep checking back until I get what I was looking for, that is, if I remember to?
Things Don’t Have to Be This Way
Here at LINBIT we use DRBD® for all of our systems. This ensures that they are always on and always available for the end users and our customers. If for some reason you landed on this site and are not familiar with DRBD, DRBD is open source software developed by us, LINBIT, for more than 20 years. In its simplest form you can think of it as network RAID 1. However, rather than having independent disks as you would in RAID, you have two or more independent systems. You essentially now need to lose two or more entire systems to experience downtime of services. Adding systems can further reduce your downtime risk, but for practical purposes, in most cases, three independent systems is all you need to build something solid.
Offering Uninterrupted Services During System Maintenance
One commonly ignored or unrealized benefit of using DRBD is that you can do system maintenance and upgrades with minimal to no interruption of services. The length of the interruption is generally tied to the type of deployment – for example if you’re using virtual machines (VMs), you can achieve live migration by using DRBD. The result is no downtime. Your VMs remain accessible through the system maintenance.
If you are running services on hardware and you need to stop and restart those services, your downtime will be limited to the failover time. Failover time is how long it takes those services to stop on one system and start on another. In many cases, this will just be a matter of seconds.
An Example Scenario
How do you do this? Let’s say that you have two server nodes: Frodo and Sam. Frodo is Primary (active and running services) and Sam is Secondary (passive, but with an up-to-date copy of Frodo’s data that is necessary to potentially run services). In this example let’s also say that you need to update a system board’s firmware and upgrade the RAM of each of your servers. Upgrading RAM would ordinarily require you to power off the system, and updating firmware might require that you reboot the system. During these times, services running on the system would be unavailable.
To do this without interrupting services, you would follow these steps:
- First, put your cluster resource manager (CRM), for example, Pacemaker, into maintenance mode. This is similar to a pause button. You are telling the CRM not to do anything that it might ordinarily do when systems go down.
- Next power off Sam (the secondary server).
- You can now install any hardware that you need to, in this example: new RAM.
- Power the system up and enter the BIOS or UEFI and verify that the system recognizes the new RAM.
- You can now update the system board firmware, which might require a reboot.
- Boot Sam into its operating system.
- At this point you can install any operating system or software updates and reboot, if you need to.
- Once Sam is back up and you have verified that everything is in good condition, including that your DRBD resource or resources are in an up-to-date state, bring the CRM out of maintenance mode.
- Next, migrate services to Sam – again depending on how you have configured things, this might cause a few seconds of unavailability of services.
- Verify that services that are now running on Sam are available.
- Repeat steps 1-4 for Frodo.
Proceed to Mount Doom. Throw ring into fire. Return to Shire.
There you have it, one of the better kept secret benefits of using DRBD: using it to minimize system maintenance downtime and prevent users and customers from having a bad experience.