In the first part of this series, we talked about five-nines reliability , the gold standard of reliability for technology companies. Five-nines reliability — 99.999% — is a measure of the uptime or the availability of the applications, networks, hardware, solutions, and other business technologies on which enterprises depend on. We also talked about why we wanted to achieve this coveted reliability standard for our ALLOY® platform and how we achieved it using Active/Active architecture.
Disaster Recovery: Why It is Not Enough for Today’s Businesses
One of the primary reasons we employ Active/Active architecture for ALLOY is that we understand our clients need more than just a Disaster Recovery (DR) plan for their critical business functions.
Enterprises know the importance of business reliability, especially in today’s always-on, always-connected digital landscape which requires enterprises to be available 24/7. Even a short period of downtime or a quick moment of unavailability can spell the difference between keeping and losing customers, which is why most enterprises have disaster recovery or contingency plans in case networks, applications, or other business technologies go down. DR allows enterprises to re-establish services, restore network connections, and recover lost data.
Disaster recovery, however, is limited in its ability to support business reliability. While DR can help businesses recover from unexpected downtime by restoring lost information, connections, and services, it cannot prevent the causes of downtime. Achieving five-nines reliability requires both prevention as well as recovery from incidents, and disaster recovery only addresses the latter.
In addition, other organizational factors contribute to the insufficiency of disaster recovery for high availability:
- Some organizations misunderstand disaster recovery. Some organizations mistake disaster recovery as the recovery from rare and catastrophic events such as fires, earthquakes, and other natural disasters. Thus, they exert greater effort to prepare for these unlikely events instead of preparing recovery and business continuity plans for less severe and more common causes of downtime, such as hardware and software failures, temporary power outages, and human error.
- Disaster recovery takes resources away from business technology resiliency. Some organizations not only fail to focus on the more likely causes of downtime, but also tend to focus more on recovering from downtime rather than prevention. As we discussed in the previous post, organizations must find the right balance between strengthening the resiliency of their business technologies and recovering from disruptions.
- Many disaster recovery plans are outdated. In 2017, a popular airline suffered massive system outages and lost over $100 million in revenues. The outages revealed that its IT infrastructure and disaster recovery plans were outdated and no longer effective. Many organizations are believed to have outdated IT infrastructure and disaster recovery plans. Some companies have failed to include virtualization or containerization technologies in their disaster recovery plans, resulting in much longer restoration cycles.
- Different departments have disparate disaster recovery plans. When disaster recovery became a buzzword in the business world, many different groups within the same organizations developed their own disaster recovery plans. With disaster recovery plans needing to be constantly reviewed, updated, and tested, they often result in chaotic and redundant processes which can do more harm than good. This puts IT departments, business leaders, and users at risk of being overwhelmed by the different risk assessment processes that each disaster recovery plan entails.
Beyond Disaster Recovery: Achieving High Availability through Resiliency
Disaster recovery remains a common approach toward achieving improved business reliability. But five-nines reliability requires enterprises to go beyond disaster recovery and into business technology resiliency. They must not only be able to quickly recover from systems outages and downtimes, but also withstand the impact of any incident that can disrupt business technologies and operations.
Enterprises must evolve from merely responding to downtimes and failures to actually preventing them.
To go beyond disaster recovery and achieve high availability, enterprises must:
- Understand how business technology ecosystems are evolving. It’s important to understand how the different technologies that comprise technology ecosystems are evolving. Today, the ability to choose software and enterprise applications is shifting away from IT departments to the different business units that use these applications. For example, sales and marketing departments have the ability to choose the cloud service that can provide the capabilities or functionalities they need without the intervention of IT departments. This also means the efforts to recover lost information and connections lies largely with the service providers rather than the IT departments. To ensure continuity of the business, mission-critical 3rd party services must be evaluated for their availability and recovery capabilities.
- Value resiliency as a competitive advantage. A stumbling block for business technology resiliency is the lack of regular focus and attention given to it. Instead of making efforts to make their business technologies more resilient to downtimes and failures, enterprises only employ disaster recovery plans which are not part of their normal daily operations and more prone to delays and failure when enacted. In today’s digital economy where people are connected 24/7, even a short moment of downtime can lead to significant loss of revenues and customers. High availability is no longer a luxury but a necessity. Enterprises must understand the importance of resiliency and how it impacts their bottom line.
- Take a holistic approach to building resiliency. Enterprises must understand that many factors and activities contribute to building business technology resiliency. Many of these activities are already being done, such as setting up redundancies and having backups in case of failures and downtimes. However, these efforts are often undertaken on an ad hoc basis serving different needs and purposes, leading to implementation gaps, unenforced standards, and duplicated efforts. Taking a holistic approach unifies these activities under a singular goal of building business technology resiliency, making these activities more focused, effective, and efficient.
- Integrate resiliency into every part of the organization. From people, users, and roles to applications and IT architectures, enterprises must embed business technology resiliency in every part of the organization. IT departments and business leaders must empower staff to conduct mock exercises, respond to issues and problems, and proactively find ways to improve the availability of the technologies they rely on. Enterprises must also emphasize that everyone within the organization, from the leaders to the staff members, has a role to play in building business technology resiliency. For critical applications and IT architecture, enterprises must develop advanced fail-safe mechanisms such as Active/Active architecture to withstand any causes of downtimes and failures from the very beginning.
How Active/Active Architecture Works
Active/Active architecture, also called dual active, is a high availability cluster configuration implemented to focus on preserving uptime in the event of an incident rather than recovery from an incident. These architectures are designed to allow continuous availability in the event of both minor and major incidents.
- Active/Active provides for fully redundant (and often synchronized) load-balanced systems acting as a single service during normal operations, with the capability for either of the systems to run independently in the event of an incident.
- Active/Active environments should ideally be provisioned in geographically separate locations to allow survival of the system in the case of natural or man-made disasters, and with enough capacity available at each location to handle the full peak expected demands. In the case of Active/Active, ‘recovery’ from an incident is important to restore the overall system back to a full Active/Active resilient state as opposed to restoring lost availability of the system.
- Active/Active is the continuous processing of live requests by each location, ensuring that the Active/Active topology is exercised and maintained as part of normal daily business operations and not just in the event of an incident.
- Additional, benefits of Active/Active may include more cost-effective utilization of redundant hardware resources for additional processing capacity (particularly when elastic scaling is implemented) as well as reduced average network latency times due to geography-based load balancing.
- Properly implemented, an Active/Active architecture can be the foundation enabling the gold standard five-nines (99.999%) availability to maintain business critical operations for the enterprise.
The Always On, Always Available Enterprise
Going beyond disaster recovery and moving towards building business technology resiliency is one of the keys to becoming an “always on, always available” enterprise. In the final part of this series, we will discuss what this means and how to make business technologies always available for both users and customers.