Preface
Enterprise software systems have become more sophisticated, relying heavily on distributed components like cloud services and microservices. Chaos engineering plays a vital role today in creating resilient distributed systems.
This article walks you through the concept of chaos engineering, its importance, and its core principles. Additionally, Iâll explain the tools widely used today for chaos engineering and delve into the benefits and challenges associated with chaos engineering practices.
What is Chaos Engineering?
Chaos engineering is the practice of intentionally injecting faults into a system to test its resilience. The goal is to identify potential failure points and correct them before they cause an actual outage or other disruption. By proactively testing how a system responds under stress, we can identify and fix failures before they end up in the news. Chaos engineering can significantly increase confidence in the resilience of distributed systems, particularly in unforeseen conditions. We literally âbreak things on purposeâ to learn how to build more resilient systems.
Why would we break things on purpose?
Think of a vaccine or a flu shot for instance COVID 19, where we inject ourselves with a small amount of a potentially harmful foreign body in order to build resistance and prevent illness. Chaos Engineering is a tool we use to build such an immunity in our technical systems by injecting harm such as latency, CPU failure, or network black holes in order to find and mitigate potential weaknesses. According to the chaos engineering report, the most common outcomes of Chaos Engineering are increased availability, lower mean time to resolution, lower mean time to detection fewer bugs shipped to product, and fewer outages. Teams who frequently run Chaos Engineering experiments are more likely to have >99.9% availability.
What's the role of Chaos Engineering in distributed systems?
Distributed systems are inherently more complex than monolithic systems, so itâs hard to predict all the ways they might fail. The eight fallacies of distributed systems that programmers new to distributed applications invariably make.
Fallacies of Distributed Systems:
Distributed systems are inherently more complex than monolithic systems, so itâs hard to predict all the ways they might fail. The eight fallacies of distributed systems that programmers new to distributed applications invariably make.
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Many of these fallacies drive the design of Chaos Engineering experiments such as âpacket-loss attacksâ and âlatency attacksâ. For example, network outages can cause a range of failures for applications that severely impact customers. Applications may stall while they wait endlessly for a packet. Applications may permanently consume memory or other system resources like Linux. And even after a network outage has passed, applications may fail to retry stalled operations, or may retry too aggressively. Applications may even require a manual restart. Each of these examples need to be tested and prepared for.
Benefits of Chaos Engineering
- Customer: The increased availability and durability of service means no outages disrupt their day-to-day lives.
- Business: Chaos Engineering can help prevent extremely large losses in revenue and maintenance costs, create happier and more engaged engineers, improve in on-call training for engineering teams, and improve the incident management for the entire company.
- Technical: The insights from chaos experiments can mean a reduction in incidents, reduction in on-call burden, increased understanding of system failure modes, improved system design, faster mean time to detection for incident management.
Organizations that benefit from Chaos Engineering
Top tech organizations such as Amazon, Netflix, Microsoft, and National Australia Bank utilize chaos engineering to achieve a better understanding of internal systematic behavior and flaws. Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD enables organizations to automate continuous experiments while controlling their potential impact.
Types of Chaos Engineering Experiments
DevOps teams have several options for running chaos engineering experiments to test various system processes.
- Latency injection: DevOps teams intentionally create scenarios that emulate a slow or failing network connection. This includes the introduction of network delays or slower response times.
- Fault injection: This involves purposefully introducing errors into the system to determine how it affects other dependent systems and whether it interrupts services. Examples of fault injections include inducing disk failures, terminating processes, shutting down a host or introducing power or temperature increases. Fault injections can help organizations identify any single points of failure, which can cause the entire system to fail if something happens to them.
- Load generation: This relates to intentionally stressing the system by sending significant traffic levels well beyond normal operations. This helps the site reliability engineers (SREs) to understand any bottlenecks in the system, which in turn allows them to build more scalable systems.
- Canary testing: This involves releasing a new product or feature to a small group of users. That way, any glitches or bugs will only affect a percentage of visitors, leaving the rest of the audience to access the existing website experience.
Principles of Chaos Engineering
Chaos engineering defines general principles to follow when designing and conducting experiments.
Define the systemâs steady-state
How does the system behave when it is steady? These definitions set the baseline for the experiments. The definition of steady state includes measurable outcomes defined using key performance indicators. Some examples of such KPIs are:
Create the hypothesis
A chaos experiment needs a hypothesis on how the system will behave if a chaotic situation arises in a production environment. It should be based on the established baselines and knowledge of the behavior and weaknesses of the system. When creating a hypothesis, ask âwhat ifâ questions or create statements on how the system should behave.
Examples include:
Experiment by changing real-world conditions
Consider real-world scenarios or events that can deviate from the steady state.
For example:
Run automated experiments in production environments
Prior systems like development, staging, and pre-production do not simulate the actual production systems. Thatâs why chaos engineering experiments run in actual production systems under controlled conditions.
Minimizing the blast radius
Since chaos engineering experiments are conducted within real production environments, it is crucial to minimize any potential performance degradation or disruptions that customers may experience during their execution. The blast radius should be determined to use metrics:
For example:
Chaos Engineering Tools
Chaos engineering tools are a relatively new approach to traditional testing methods used to establish confidence in systems. Software platforms will inevitably fail, and therefore it's critical to pinpoint weaknesses and fix them before they negatively impact business operations.Through the deployment of assumptions and successful chaos experiments, chaos engineering tools can provide a roadmap for uncovering infrastructural failures or unresponsive systems.
Chaos Mesh
Pros
Cons
Chaos Mesh is an open-source cloud-native tool. Using various fault simulations, Chaos Mesh helps organizations determine system abnormalities that may occur during various portions of the development, testing process, and production stages. As an open-source chaos tool that's created with a web user interface known as the Chaos Dashboard, Chaos Mesh can be added to DevOps workflows to spot potential areas of weakness and timeouts. To ensure resiliency, Chaos Mesh utilizes chaos experiments within Kubernetes environments. It's able to use various types of scenarios related to fault simulations within a distributed system. Chaos Mesh is able to deploy attacks that test network latency, system time manipulation, resource utilization, and more. The Chaos Dashboard can be used to modify and manage various forms of experiments within set timeframes.
Key Features
Cost
As an open-source chaos tool, Chaos Mesh is free to use without a commercial license.
Should I use Chaos Mesh?
Chaos Mesh offers an open-source technology that can be used in Kubernetes to design and manage automated experiments. However, be wary of certain limitations to the technology. Predicting failures can be a cumbersome task due to the complexities in cloud operations. Unreliable functions and outages can result in a downgraded reputation and a loss of consumer trust.
Chaos Monkey
Pros
Cons
Netflixâs Chaos Monkey is an open-source chaos engineering tool originally created by Netflix developers. It was developed to help test their system reliability and resiliency after moving to the AWS cloud. The software functions by implementing continuous unpredictable attacks. Chaos Monkey uses the basic fundamental approach of terminating one or more virtual machine instances The configurability of Chaos Monkey allows for easy scheduling and close monitoring. The technology is easily replicable but can cause headaches if users are unprepared for the aftermath of attacks. Users can check for outages prior to deployment but must be able to write and edit custom Go code.
Key Features
Cost
As open-source software, Chaos Monkey is free to use without a commercial license.
Should I use Chaos Monkey?
Chaos Monkey is a popular chaos engineering tool. While it may have revolutionized the open-source community, its contemporary application is far less practical today. Chaos Monkey is useful to an extent, but users must take into account its limitations and arduous deployment capabilities.
Gremlin
Pros
Cons
Gremlin is the first hosted chaos engineering platform designed to improve web-based reliability. Offered as software-as-a-service (SaaS), Gremlin is able to test system resiliency using multiple attack types. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.
Key Features
Cost
Gremlin's pricing has fluctuated over the years ranging from per-agent pricing to attacks per target to support the frequency of testing required by a team.
Should I use Gremlin?
As the world's first managed enterprise chaos engineering technology, Gremlin provides users with the ability to launch dozens of attack vectors, stop and roll back attacks, and improve system reliability. Designed with the mission of creating a sustainable and reliable internet, Gremlin pinpoints software weaknesses to minimize revenue loss and negative systematic impacts.
Conclusion
In conclusion, chaos engineering continues to play a pivotal role in ensuring the resilience and reliability of modern software systems specially distributed systems, offering organizations valuable insights and tools to proactively address potential failures and disruptions.
NOTE: I'm constantly delighted to receive feedback. Whether you spot an error, have a suggestion for improvement, or just want to share your thoughts, please don't hesitate to comment/reach out. I truly value connecting with readers!