Can we prevent network outages, before they happen?
Major AWS Outage:
The recent AWS outage was so big and lasted so long that it affected thousands of Amazon customers that depend on AWS infrastructure. Some even tweeted that “half of the Internet was down due to the outage.”
Amazon deserves kudos for being transparent about this incident (and many prior incidents) by releasing detailed analysis of what caused the outage. Amazon described that the outage was caused by human error, when an incorrect command was entered that accidentally took more servers offline than intended. That error snowballed into a larger outage when the index subsystem and the placement subsystem wouldn’t restart quickly.
As big as this outage was, it follows a very predictable pattern that other outages have followed in the past – a complex system that is difficult to maintain often results in a human error.
Cause of Outages:
Any enterprise is vulnerable to a crippling outage similar to that of AWS, primarily for two reasons:
- Increasing Complexity: Today’s IT infrastructure is increasingly complex with layers of software applications, servers, networks, and storage. It’s not uncommon to see hundreds of different software and hardware components in enterprise data centers stitched together by new and legacy architectures and managed by fewer and fewer people. Adoption of advancements in virtualization, containerization of apps, software defined networks, and security layers complicate the IT infrastructure even more.
- Rate of Change: As businesses evolve at rapid pace, so are the demands placed on IT infrastructure requiring constant changes. Often an IT department may be making hundreds or even thousands of changes per day to their software applications, servers, networks, and storage. The problem is compounded by imperfect change control documentation. A Large number of changes are sometimes documented in a hodgepodge of ticketing systems and spreadsheets, or worse, not documented all. How would an operator know what changes were made in the past and why? And what would happen if a new change is made or an old change was undone?
These factors put too much of stress on human administrators who have no way of ensuring that their everyday actions don’t cause unintended outages.
It is important for data center administrators to take advantage of new and advanced techniques whenever possible. Ideally, IT and data center administrators would like to verify all the complex layers as one single system (including applications, databases, servers, storage, network) at once against failures before making every change. This remains a holy grail of IT!
IT administrators use several possible solutions to guarantee system resilience, uptime, availability, and disaster recovery. They often employ from a wide spectrum of approaches. Some of the approaches are:
- Testing Smaller Domains: While all the layers in IT systems can’t be verified at once, perhaps each layer can be separately verified. Each layer is treated as a separate fault domain which is constantly tested. This means testing software applications separately, testing databases separately, testing network separately and so on.
- Active Fault Injection: Many cloud providers deploy this technique. A team of people is chartered to actively inject faults into the system every single day and create negative scenarios. Some examples include forcing ungraceful system shutdowns, or physically unplugging network connectivity, or shutting down the power of zones in the data center, or even simulating application level attacks. This approach forces the dev-ops team to fine tune their software and processes.
- In this recent outage, Amazon described that their index subsystem and the placement subsystem had not been restarted for a long time. This caused the faults to multiply. This implies Amazon may not have included this test case in their active fault injection. Many enterprises with similarly long running systems are at risk.
- Emulation: This approach calls for making a replica of the infrastructure and emulating the change on the replica to understand the behavior. Once the changes are verified on the replica, the changes are made in production systems. Needless to say, this approach is quite expensive. It is also a challenge to keep the production environment and the replica in identical states at all times.
- Formal Verification: Formal Verification methods, by definition, ensure integrity, safety, and security of the end-to-end system. Formal Verification methods have been used in aerospace, airline and semiconductor systems for decades. With advancements in computing, it is now possible to bring Formal Verification methods to the networking layer of IT infrastructure to build a mathematical model of the entire network.
- Formal Verification methods can perform exhaustive mathematical analysis of the entire network’s state against a set of user intentions in real time, without emulation. It can allow evaluation of a broad range of factors. Mathematical modeling can allow “what-if” scenario analysis of proposed changes which, for example, would have prevented a 2011 Amazon outage that was due to a router configuration error.
- Vendor Specific Tools: Many vendors create tools to test and verify their products. For example, Microsoft’s Project Springfield is a tool for rooting out potential security vulnerabilities in software including Windows, Office and other Microsoft products. Springfield grew from research in formal methods – a broad area that also includes verification tools. While Springfield is targeted for Microsoft software applications, there are other similar tools for a diverse set of software applications from different vendors.
A combination of these approaches can help IT administrators and operations teams to proactively avoid problems. By integrating continuous verification into the operations workflow, IT staff can be confident of the changes, without worrying about unintended outages and vulnerabilities.
Read our white paper to learn how Veriflow is leveraging Formal Methods to eliminate network outages and vulnerabilities. Download Today