The Stretched Cluster Disaster Recovery Strategy

Posted onMarch 19, 2024

My colleague, Yusuf Yildiz, a solutions architect here at LINBIT® strongly believes disaster recovery (DR) should be thought of as a plan, rather than a solution. It is impossible for there to be a one-size-fits-all DR solution that works for everyone. Each organization needing DR will have different values that it places on different services and applications, and different available resources to implement DR. These resources might be hardware, personnel, number of sites, and so on. Of course, organizations will also have different budgets with which they can implement a DR strategy. In this sense, each organization needs to come up with a DR plan that is realistic to the organization’s individual constraints.

That said, while there is no single DR solution that works for everyone, there are common DR strategies that an organization can adapt to its particular DR needs and resources. One such strategy is the stretched cluster. A brief discussion of this DR strategy and an overview of the LINBIT approach to implementing it are the topics of this article.

An Overview of the Stretched Cluster Strategy

A stretched cluster is a DR strategy that if you can implement it, will give you a fail-proof, near 100% uptime result. For a stretched storage cluster, the LINBIT approach involves at minimum five nodes and three sites. Two of the sites require two copies of each resource that you want to keep up and running. The third site is an arbitrator or tiebreaker site.

Because replicating data between sites needs to happen in real-time or as near to real-time as possible, the stretched cluster strategy requires great, low-latency, near LAN speed connections between sites. Realistically, this bandwidth consideration limits the stretched cluster deployment to campus or metro area networks. If you need a DR plan that works across longer distances, you will need a different DR strategy and will likely need to accept less than in-real-time failovers.

For the reasons just mentioned, deploying a stretched cluster can come with a big cost. You will need to weigh the cost against the value that you place on your applications, services, and data. For example, you will need to calculate potential loss of revenue or loss of customers if your service offerings remain unavailable.

If any disruption is unacceptable to your operations, and you can afford the costs, a properly implemented stretched cluster is your ticket to near 100% uptime in the face of site disaster.

The LINBIT Approach to a Stretched Cluster

The LINBIT approach to the stretched cluster DR strategy focuses on the storage aspects of the strategy and involves a minimum of five nodes, spread across three sites, as shown in the diagram that accompanies this article.

For further resiliency, albeit with increased costs, you can have more nodes participate in the stretched cluster. Because of the additional costs of deploying more nodes, this might be a route you can take if you have already implemented more than five nodes across your sites, for example, for compute or other cluster design reasons. With some additional preparation and setup, you could add these additional nodes to your stretched storage cluster.

The Two Main Sites

In the LINBIT approach to a stretched storage cluster, there are two “main” sites and a third “tiebreaker” site. The two main sites each need to have two copies of each replicated block device. LINBIT SDS (LINSTOR® and DRBD®) provides the solution that replicates managed block devices across your storage cluster nodes and provides the basis for making them highly available by using a cluster resource manager (CRM).

Open Source Software Components in a Stretched Cluster

DRBD is the open source software that replicates your block devices across your stretched cluster. LINSTOR is the open source software-defined storage (SDS) solution that makes creating and managing your block devices simpler, more flexible, and scalable, than it would otherwise be, for example, compared to editing and maintaining block device configurations manually. The CRM could be open source software such as Pacemaker or DRBD Reactor.

Balancing Block Devices Between Sites

LINBIT SDS typically uses an active/passive configuration for its storage replication. With this configuration, a replicated block device is only writable on one node at a time: the primary (active) node. DRBD then replicates write operations to the primary node to a peer node or nodes in the cluster: the secondary (passive) node or nodes. A write to a replicated block device is not considered complete until it is replicated to all the secondary (passive) nodes in the cluster.

However, in a stretched cluster, where there are two main sites, you do not need to leave one site entirely passive, costing you money while in a standby state. You can “balance” your storage resources across your two sites.

For example, if you have two replicated block devices, for two different applications, one block device can be primary on one site, and the other block device can be primary on the other site. You can accomplish this by configuring your CRM, for example, so that certain resources prefer to run on nodes at a site that you specify. You can also set auxiliary properties in LINSTOR to create replica auto-placement constraints for your replicated block devices, causing replicas to be placed on nodes in certain sites.

With the deployment architecture shown in the diagram accompanying this article, when a failover event happens, there are replicas of your block devices across your two sites. This way, either site can take over services when the other site goes down.

The details of configuring this resource balancing are beyond the scope of this article. However, I’m making a note to myself that they could be the topic for a future blog post or how-to technical guide. If you are stuck on the details particular to your organization’s deployment needs, you can reach out to the LINBIT team for help in the meantime.

The Tiebreaker Site

To avoid a split brain cluster, your stretched cluster needs to implement an arbitrator site that can avoid the situation where more than one site “thinks” that it should be the primary site for a given block device. This arbitrator, tiebreaker site is connected to both of your main sites and forms a quorum partition with whichever site is primary for a given resource. Only one main site in a quorum partition can be primary at any given time. The mechanism for determining quorum can either be DRBD internal quorum logic or else a separate mechanism. For example, the open source Booth software is often used as a quorum mechanism with clusters that use Pacemaker as a CRM.

The cluster node at the tiebreaker site does not need to be as costly a deployment as the storage nodes at your main sites. In the LINBIT stretched cluster approach, the tiebreaker node only needs to be able to run the LINSTOR controller service, for LINSTOR controller high availability, and the LINSTOR satellite service for DRBD quorum purposes. However, even though the tiebreaker node will be running the LINSTOR satellite service, you do not need to include additional physical storage, outside of what is needed for the node’s operating system. This is because you can take advantage of the LINBIT SDS concept of diskless attachment, also known as DRBD clients.

Conclusion

This is the LINBIT approach to a stretched cluster for creating a DR strategy for your mission-critical services. This article has only been an overview of the concepts and the strategy. To implement this strategy, you will have to work out the details that will be particular to your applications, services, and needs. You will also need to configure your LINBIT SDS cluster, CRM, and other components, such as networking equipment, accordingly.

This is not something you need to do alone, however. The LINBIT team of expert solutions architects and engineers can work with your team to deploy the DR strategy that will work best for your environment and needs. The LINBIT team works with organizations of varying sizes, across different industries, and with different resources and needs, worldwide. Reach out to us.

Share this post

More to Explore

Michael Troutman

Michael Troutman has an extensive background working in systems administration, networking, and technical support, and has worked with Linux since the early 2000s. Michael's interest in writing goes back to an avid reading filled childhood. Somewhere he still has the rejection letter from a publisher for a choose-your-own-adventure style novella, set in the world of a then popular science fiction role-playing game, cowritten with his grandmother (a romance novelist and travel writer) when at the tender age of 10. In the spirit of the open source community in which LINBIT thrives, Michael works as a Documentation Specialist to help LINBIT document its software and its many uses so that it may be widely understood and used.

Talk to us

First name

Last name

Company name

Country

Message

I agree to receive other communications from LINBIT.*

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Talk to us

First name

Last name

Company name

Country

Message

I agree to receive other communications from LINBIT.*

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Software-Defined Storage

High Availability

Disaster Recovery

Further Solutions

Guides, Manuals, & Training

From Our Community

Knowledge Base

Company

Partners

Events

ProIO

The Stretched Cluster Disaster Recovery Strategy

An Overview of the Stretched Cluster Strategy

The LINBIT Approach to a Stretched Cluster

The Two Main Sites

Open Source Software Components in a Stretched Cluster

Balancing Block Devices Between Sites

The Tiebreaker Site

Conclusion

Recent Posts

Recent Posts

More to Explore

Michael Troutman

Talk to us

Talk to us

Legal

Resources

Company