Five-nines or 99.999% reliability is the gold standard for software and hardware vendors, cloud services providers, data centers, and just about everyone in the tech industry. It’s no surprise that we wanted to achieve this coveted reliability standard for our ALLOY® platform that solves business critical data integration and data management problems for Fortune 500 companies.
Achieving five-nines reliability for ALLOY was a journey that started almost three years ago. During an internal developer summit in early 2015, in Liaison’s Seattle office, our CTO challenged the team to develop and deploy each and every microservice in ALLOY with the underlying commitment of providing five-nines reliability. These were the very early days of ALLOY platform with only a dozen or so microservices deployed in a single data center. What culminated from there was a strategy for ALLOY Active-Active and development of an execution plan that influenced a majority of our ALLOY operations roadmap for the years to come. Today, Active-Active is a reality for ALLOY across multiple data-centers in Europe and later this year we will roll it out for our US operations.
Our path to achieving active-active was littered with engineering and operational challenges, with assumptions that changed with every test, and with balancing demands of new customers that required launching new microservices. From a dozen microservices in 2015 to over 200 that are deployed today on ALLOY, our execution plan in retrospect looks a lot different than what we initially conceptualized.
Why do it?
Our business case for developing this capability was as follows:
- Dramatically reduced customer impact during both major and minor incidents, as any impacted traffic is quickly re-routed to the alternate datacenter until the incident is resolved
- In-flight traffic impacted by a localized datacenter issue can be immediately recovered and reprocessed in the alternate data center
- Zero downtime for maintenance by performing changes in a rolling fashion – piloting changes one microservice, and one data center at a time
- Geographic load balancing to reduce latency and automatically interact with the datacenter closest to the workflow origin
Active/Active Mission Control Console
In a nutshell, Active-Active, which is also called dual active, describes a network of separate and independent nodes running the same kind of processes and services simultaneously. In case one node fails, the other can continue running the same processes and services ensuring the continuous uptime or availability of applications or networks. These nodes can be networks or applications themselves powering other networks and applications or data centers or cloud providers supporting the services, processes, and applications of enterprises and organizations. Another objective of the Active-Active architecture is load-balancing to minimize failures and prevent any node from getting overloaded with tasks and processes. The United States implementation runs on two US Data Centers located in Phoenix and Atlanta, while the European implementation runs on two Data Centers located in London and Helsinki. Traffic switching is done at runtime with coordination and automation of error handling and failover. Visibility into traffic routes, node health, and global DNS is provided in a holistic manner through an Active-Active mission control console to provide insight into traffic switching and to diagnose platform issues.
In Part 2 of this series, we will take a deep dive into the execution plan and discuss some key technology choices and assumptions that were made by our development team.