This article describes the reasoning behind and the benefits of a high-availability (HA) solution for providing services and applications. This article is intended as a general overview and introduction to HA, with a nontechnical reader in mind.
Background on High-Availability
According to techtarget.com:
High-availability is the ability of a system to operate continuously without failing for a designated period of time.
So, HA is a term used to describe a system that meets an agreed upon operational performance level. What that performance level is can be determined by the organization that provides the system. However, an HA system usually provides services or applications that users rely on. Such factors as the cost of downtime to the organization and the patience of users or customers during downtime will likely play an important role in determining the performance level.
Data shares, applications, and e-commerce websites are just some of the places where you will find HA solutions in use. HA used to be implemented exclusively by large corporations and organizations. However, because of the ubiquity of the dependency on digital solutions and transactions, now even small organizations rely on HA solutions.
Further Defining High-Availability
As the name suggests, high-availability (HA) is a term that describes the ability of a system’s components to continue functioning in a specified time frame. HA can be measured relative to fully (100%) operational. A popular standard of availability for a product or service is “five 9s” (99.999%) availability. This translates to 5.26 minutes of downtime per year.
To achieve such a minimum level of downtime, an HA system must be well designed and its segments well tested before an organization deploys it into production.
Significant parts of an HA system include backup and failover processing, as well as data storage and access. All the individual components that work to provide these parts of the system must always be “available”.
The practical result of this is that the HA solution duplicates components of the system to avoid single points of failure. If one of the components fails, a failover process is activated, and transfers the handling to the redundant component. This method returns the system to a “normal” functioning state in a matter of microseconds. It is important to mention that “component” here can refer to something physical like a server or network card within a server, or something nonphysical like user data or a software application.
Creating a Highly Available System
Creating a highly available system means setting up a system that can remain operational over a long period of time. HA allows services with critical importance to be “always” online and undisturbed, regardless of location or external situation. Even if some of the system components fail, the system will continue to provide services and applications. The user using the applications or services should not notice that there has been failover. Setting up such a system involves ensuring redundancies, and when data is involved, replication.
However, increasing the components of a system does not automatically turn it into a highly available system. On the contrary, the greater the system’s complexity, the greater its risk of failure.
HA solutions create redundancy within a cluster of machines to eliminate any single point of failure. This includes multiplying network connections with redundant cables, network interface cards, switches, and switch ports. And not just the network, all components need to be redundant to achieve HA, such as storage, compute, and other aspects of the system. The HA solution may even require geographically distributed clusters, to tolerate failures due to events like natural disasters or power outages that may affect an entire region’s operations.
Modern system architectures also use load balancing devices or software within an HA system to distribute workloads (either compute, storage, or network) across multiple instances of whatever component handles the workload. This helps to optimize resource usage, maximize performance, minimize response times, and avoid overburdening any one component.
The Main Benefits of HA
Availability and the Five Nines Uptime
As mentioned earlier, an industry standard benchmark for measuring uptime is the five nines, that is, 99.999% uptime.
This metric can be applied to:
- The entire system
- The system components
- The software and processes running within the system
The more 9s an HA system has, the higher its uptime. The aim of an HA system is to provide as little downtime as possible and the framework to continue to provide the desired services. However, time and costs considerations for implementing and maintaining an HA solution will put practical constraints on the uptime that an organization can offer its users. And of course, this uptime should be defined in a service level agreement (SLA) between the organization and the users.
At a practical level, users experience uptime as the reliability of an organization’s services or applications. This is one of the most significant properties of an HA system. Reliability may go beyond the convenience of an always available service to being essential, mainly when a system’s function offers a critical service. For example, air traffic control. In such scenarios, even a millisecond of delay could be the difference between life and death.
If the HA system encounters a load spike, for example, a network traffic increase or other increase in demand for resources, the system should be able to scale to meet those needs in the moment. By integrating scalability features into the system, the system would adapt rapidly to any changes affecting how the system processes requests for its services. The HA system should not have to rely on manual intervention, from a systems administrator, for example, to meet increased demand.
If an error appears, the system can adjust and compensate while staying up and operating. This form of structure requires forethought and contingency planning. One of the essential characteristics of a high-availability system is anticipating problems and preparing for them, rather than reacting to them after they happen. For a practical example, this could mean configuring a system to seamlessly revert a new version of software to a previous version should there be an unrecoverable error when deploying a new version.
Comparing High-Availability With Disaster Recovery
High-availability and disaster recovery are terms that are often used together to describe resilient systems but they should not be confused for each other.
Disaster recovery (DR) does exactly what the name implies: It provides a detailed plan that helps a system recover quickly after it has experienced a failure. If you have an HA system, you might wonder why you would also need disaster recovery.
DR usually is focused on getting a system and its services back online after a severe failure, possibly due to outside uncontrollable circumstances, like a natural disaster. A DR plan could deal with the recovery of an entire region’s operations.
HA is focused, however, on failures that are more likely to happen in the normal day-to-day operations of a system offering services, for example, a failing server or hard disk drive.
Possible HA Implementations
One of the methods for achieving HA is by using multiple application servers (or nodes). If you experience a sudden surge in traffic, your server may shut down, and requests from it can’t be made. This leads to more downtime. To avoid such scenarios, applications are deployed on systems that use redundant components, across several servers. If one server fails, the rest can take the extra load, allowing for a high fault tolerance within the system.
Another method for achieving HA is by scaling databases, application stacks, and other services. HA is perhaps the most widely-used method to save and protect the data of your users. As any organization knows, losing such vital information can often be a costly experience. Physical equipment can likely be replaced. User data is irreplaceable without backups or replication in advance.
Finally, an organization can achieve HA by also spreading the servers across many geographical locations. Political events, natural disasters, and failures of the electric grid can all lead to a shutdown of your servers, even if there are many but they are clustered in one geographic location. To ensure the safety of the data and complete protection, modern solutions deploy servers worldwide. This further increases their reliability and allows for flexible disaster recovery plans that can bring systems back up more quickly than waiting for local response and mitigation.
LINBIT’s High-Availability Solutions
LINBIT® – the creator of DRBD® block storage replication software – has over 20 years of experience in storage, high-availability systems, and disaster recovery. LINBIT is proud to provide HA services to other notable companies and organizations. Many of them choose LINBIT HA, the enterprise solution built on DRBD, because of the lack of vendor lock-ins and the ability for an organization to use commercial off-the shelf (COTS) system components. Clients pay only with respect to what they use. The data that clients replicate with LINBIT HA solutions is accessible at any moment and can be switched to other platforms. LINBIT HA can handle databases, file servers, storage targets, and application stacks – operating in cloud environments, or on premises, or in a hybrid setup – while maintaining low total cost of ownership (TCO).
If you’re interested in increasing your uptime, and providing reliable and scalable services to your customers and users, and without sacrificing performance, contact us to learn more about LINBIT HA or request an evaluation or quote.