LINBIT® and many other organizations that specialize in high-availability (HA) solutions use and support the cluster manager Pacemaker. Pacemaker, when properly configured, does a great job at maintaining high availability for services, applications, and other resources within a single-site cluster on a LAN. However, as robust a solution as Pacemaker is, it was never designed to operate across a WAN, or any high latency networks. That is where some additional tools, for example, Booth, DRBD®, and DRBD Proxy, can complete an HA solution and offer true disaster recovery (DR) for your services and for the data on which they might rely.
The Need For a Multisite Disaster Recovery Solution
While single-site high availability might be enough in some cases, there might come a time where the potential loss of revenue or customer confidence in the services that you offer might justify the expense and effort of setting up a DR solution. Having a properly set up multisite DR solution (sometimes called geo-clustering) will mean that even if a single site that hosts your services and data goes down, a second site is ready to take over the hosting role.
DRBD & DRBD Proxy
DRBD is high performance data replication software that allows for HA data in a single-site Pacemaker cluster. However, like Pacemaker, it was never designed to operate across a WAN. That is where DRBD Proxy comes in. DRBD Proxy is LINBIT-developed software that provides the compression and cache operations that compliment DRBD data replication and allow for a true DR solution across a WAN.
The Booth Ticket Manager
To overcome the issue of Pacemaker not being able to orchestrate failovers between data centers and across long distances, the booth add-on for Pacemaker was conceived back in late 2011. LINBIT has been involved in the development of booth since 2013, and has been offering it as a supported solution since 2015.
Have Your Ticket Ready
Booth addresses the shortcomings of Pacemaker by introducing the concept of “tickets”. Booth constrains Pacemaker’s ability to start particular resources by issuing or revoking tickets. Only on the site which holds a valid booth ticket can Pacemaker start constrained resources. This can be thought of as being similar to the token ring networks of days past. If a site loses communication with the rest of the booth cluster its ticket will not renew and Pacemaker will stop resources within the expected time frame. For booth to ensure that there is no cluster split, and two sites never have the ticket at the same time, you should configure an arbitrator node to achieve quorum, and set an expiration period on the tickets.
The Booth Arbitrator Node
The arbitrator node does not have to have the same specifications as the other nodes that host data or services in your clusters. Its only role is to help achieve quorum in your DR solution. It only needs to run the booth arbitrator software and have a WAN connection to your other two sites. You can either use a minimally set up physical machine if you have the luxury of a third site, or else a virtual machine instance in a public cloud if you do not.
Redirecting Service Traffic To a Failover Site
While running Pacemaker with booth addresses the issues of high availability across a WAN for disaster recovery, one issue which has always proven difficult is redirecting client traffic to the new site.
Past demonstrations of booth have simply used a round-robin DNS (such as in my Booth Geo Cluster Demo demonstration). While round-robin DNS is easy to configure and simple, it is inefficient because every other request is discarded. Plenty of other specialty options exist such as a software load balancer, for example, HAProxy, or else a hardware load balancing appliance, or a dynamic DNS update type solution, for example, Route 53, DynDNS, and others.
Fortunately, Pacemaker allows you to use a virtual IP address resource to offer a single IP address, through which your resources can be reached, regardless of which cluster node hosts the resources.
Setting Up a Multisite Disaster Recovery Solution
To guide you through setting up the solution in this article, you can download the Geo-Clustering with DRBD 9 and DRBD Proxy in RHEL 8 technical guide. This guide describes, step-by-step, how to configure a disaster recovery solution by using Red Hat Enterprise Linux (RHEL) 8, Pacemaker, Booth, and DRBD Proxy for data replication, to offer a highly available, multisite service. For an example use case, the guide uses a MariaDB service.
A Video Overview & DRBD Proxy Video Demonstration
For a brief but highly detailed overview of this multisite HA and DR solution, and an explanation of the components used, beyond what is covered in this article, check out the Geo-Clustering with Pacemaker & DRBD Proxy video on the LINBIT YouTube channel. For a demonstration of DRBD Proxy replicating data between two sites, check out the Linux Disaster Recovery Replication with DRBD Proxy demonstration video, also on the LINBIT YouTube channel.
đź“ť IMPORTANT: DRBD Proxy is one of the few parts of the LINBIT software family that is not published under an open source license. For a free evaluation license, if you are interested in this solution, contact LINBIT sales.