Understanding Chaos Engineering

Preface

Enterprise software systems have become more sophisticated, relying heavily on distributed components like cloud services and microservices. Chaos engineering plays a vital role today in creating resilient distributed systems.

This article walks you through the concept of chaos engineering, its importance, and its core principles. Additionally, I’ll explain the tools widely used today for chaos engineering and delve into the benefits and challenges associated with chaos engineering practices.

What is Chaos Engineering?

Chaos engineering is the practice of intentionally injecting faults into a system to test its resilience. The goal is to identify potential failure points and correct them before they cause an actual outage or other disruption. By proactively testing how a system responds under stress, we can identify and fix failures before they end up in the news. Chaos engineering can significantly increase confidence in the resilience of distributed systems, particularly in unforeseen conditions. We literally “break things on purpose” to learn how to build more resilient systems.

Why would we break things on purpose?

Think of a vaccine or a flu shot for instance COVID 19, where we inject ourselves with a small amount of a potentially harmful foreign body in order to build resistance and prevent illness. Chaos Engineering is a tool we use to build such an immunity in our technical systems by injecting harm such as latency, CPU failure, or network black holes in order to find and mitigate potential weaknesses. According to the chaos engineering report, the most common outcomes of Chaos Engineering are increased availability, lower mean time to resolution, lower mean time to detection fewer bugs shipped to product, and fewer outages. Teams who frequently run Chaos Engineering experiments are more likely to have >99.9% availability.

What's the role of Chaos Engineering in distributed systems?

Distributed systems are inherently more complex than monolithic systems, so it’s hard to predict all the ways they might fail. The eight fallacies of distributed systems that programmers new to distributed applications invariably make.

Fallacies of Distributed Systems:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Many of these fallacies drive the design of Chaos Engineering experiments such as “packet-loss attacks” and “latency attacks”. For example, network outages can cause a range of failures for applications that severely impact customers. Applications may stall while they wait endlessly for a packet. Applications may permanently consume memory or other system resources like Linux. And even after a network outage has passed, applications may fail to retry stalled operations, or may retry too aggressively. Applications may even require a manual restart. Each of these examples need to be tested and prepared for.

Benefits of Chaos Engineering

Customer: The increased availability and durability of service means no outages disrupt their day-to-day lives.
Business: Chaos Engineering can help prevent extremely large losses in revenue and maintenance costs, create happier and more engaged engineers, improve in on-call training for engineering teams, and improve the incident management for the entire company.
Technical: The insights from chaos experiments can mean a reduction in incidents, reduction in on-call burden, increased understanding of system failure modes, improved system design, faster mean time to detection for incident management.

Organizations that benefit from Chaos Engineering

Top tech organizations such as Amazon, Netflix, Microsoft, and National Australia Bank utilize chaos engineering to achieve a better understanding of internal systematic behavior and flaws. Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD enables organizations to automate continuous experiments while controlling their potential impact.

Types of Chaos Engineering Experiments

DevOps teams have several options for running chaos engineering experiments to test various system processes.

Latency injection: DevOps teams intentionally create scenarios that emulate a slow or failing network connection. This includes the introduction of network delays or slower response times.
Fault injection: This involves purposefully introducing errors into the system to determine how it affects other dependent systems and whether it interrupts services. Examples of fault injections include inducing disk failures, terminating processes, shutting down a host or introducing power or temperature increases. Fault injections can help organizations identify any single points of failure, which can cause the entire system to fail if something happens to them.
Load generation: This relates to intentionally stressing the system by sending significant traffic levels well beyond normal operations. This helps the site reliability engineers (SREs) to understand any bottlenecks in the system, which in turn allows them to build more scalable systems.
Canary testing: This involves releasing a new product or feature to a small group of users. That way, any glitches or bugs will only affect a percentage of visitors, leaving the rest of the audience to access the existing website experience.

Principles of Chaos Engineering

Chaos engineering defines general principles to follow when designing and conducting experiments.

Define the system’s steady-state

How does the system behave when it is steady? These definitions set the baseline for the experiments. The definition of steady state includes measurable outcomes defined using key performance indicators. Some examples of such KPIs are:

“The system latency is below 300ms.”

“The error rate is below 3%.”

Create the hypothesis

A chaos experiment needs a hypothesis on how the system will behave if a chaotic situation arises in a production environment. It should be based on the established baselines and knowledge of the behavior and weaknesses of the system. When creating a hypothesis, ask ‘what if’ questions or create statements on how the system should behave.

Examples include:

“If we increase the load by 1x, the system can handle it without issue.”

“An increase in request latency will not impact the user experience.”

“If the primary database is down, the system will automatically failover to the secondary database with minimum downtime.”

Experiment by changing real-world conditions

Consider real-world scenarios or events that can deviate from the steady state.

For example:

Events resulting in hardware and software failures.

High-network latency and error rates.

Network traffic spikes.

It helps identify vulnerabilities and ensures that the system can handle different scenarios.

Run automated experiments in production environments

Prior systems like development, staging, and pre-production do not simulate the actual production systems. That’s why chaos engineering experiments run in actual production systems under controlled conditions.

Minimizing the blast radius

Since chaos engineering experiments are conducted within real production environments, it is crucial to minimize any potential performance degradation or disruptions that customers may experience during their execution. The blast radius should be determined to use metrics:

For example:

The number of affected users

Impacted locations

Workload quantities

Therefore, it is advisable to schedule these experiments during non-peak times and ensure the availability of backup systems for restorations.

Chaos Engineering Tools

Chaos engineering tools are a relatively new approach to traditional testing methods used to establish confidence in systems. Software platforms will inevitably fail, and therefore it's critical to pinpoint weaknesses and fix them before they negatively impact business operations.Through the deployment of assumptions and successful chaos experiments, chaos engineering tools can provide a roadmap for uncovering infrastructural failures or unresponsive systems.

Chaos Mesh

Pros

Easy-to-use functionality and automation.

The user interface supports many different configurations.

Experiments can be paused and resumed

Cons

Experiments run indefinitely as there is no ability to schedule attacks

Node-level attacks cannot be run

Cannot control user access within the dashboard; as a result, there are increased security risks

Chaos Mesh is an open-source cloud-native tool. Using various fault simulations, Chaos Mesh helps organizations determine system abnormalities that may occur during various portions of the development, testing process, and production stages. As an open-source chaos tool that's created with a web user interface known as the Chaos Dashboard, Chaos Mesh can be added to DevOps workflows to spot potential areas of weakness and timeouts. To ensure resiliency, Chaos Mesh utilizes chaos experiments within Kubernetes environments. It's able to use various types of scenarios related to fault simulations within a distributed system. Chaos Mesh is able to deploy attacks that test network latency, system time manipulation, resource utilization, and more. The Chaos Dashboard can be used to modify and manage various forms of experiments within set timeframes.

Key Features

Chaos Mesh uses a Kubernetes-based interface that's supported with full automation and graphical capabilities used in the testing of high visibility distribution systems such as Apache APISIX and RabbitMQ

Chaos Mesh technology is able to test various scenarios using event-driven fault simulations

Chaos Mesh provides the ability to design experiments on the platform using different variables and status checks

Cost

As an open-source chaos tool, Chaos Mesh is free to use without a commercial license.

Should I use Chaos Mesh?

Chaos Mesh offers an open-source technology that can be used in Kubernetes to design and manage automated experiments. However, be wary of certain limitations to the technology. Predicting failures can be a cumbersome task due to the complexities in cloud operations. Unreliable functions and outages can result in a downgraded reputation and a loss of consumer trust.

Chaos Monkey

Pros

Configurable technology allows for easy monitoring and scheduling of attacks

Open-source software has no licensing costs

Extensive development history

Cons

Can only perform one type of experiment

Attacks are randomized and users have limited control of the blast radius

Requires writing custom code

Netflix’s Chaos Monkey is an open-source chaos engineering tool originally created by Netflix developers. It was developed to help test their system reliability and resiliency after moving to the AWS cloud. The software functions by implementing continuous unpredictable attacks. Chaos Monkey uses the basic fundamental approach of terminating one or more virtual machine instances The configurability of Chaos Monkey allows for easy scheduling and close monitoring. The technology is easily replicable but can cause headaches if users are unprepared for the aftermath of attacks. Users can check for outages prior to deployment but must be able to write and edit custom Go code.

Key Features

Detects systems bottlenecks to help limit disruption to production environments

The ability to test resiliency and availability of applications at an infra level

Tests can be scheduled during certain timeframes

Allows for easy monitoring

Cost

As open-source software, Chaos Monkey is free to use without a commercial license.

Should I use Chaos Monkey?

Chaos Monkey is a popular chaos engineering tool. While it may have revolutionized the open-source community, its contemporary application is far less practical today. Chaos Monkey is useful to an extent, but users must take into account its limitations and arduous deployment capabilities.

Gremlin

Pros

Easy-to-use UI allows for various attacks and tests to onboard teams

Support with API for creating manual integrations

Evaluates reliability based on a variety of different factors

Cons

Software is not customizable

Challenging to integrate experiment JSON files into the software delivery pipeline

Minimal reporting capabilities

Gremlin is the first hosted chaos engineering platform designed to improve web-based reliability. Offered as software-as-a-service (SaaS), Gremlin is able to test system resiliency using multiple attack types. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.

Key Features

Controlling failures in a precise and controlled manner

Custom scenarios that include multi-levels of system attacks

Testing process for memory leaks, latency injections, disk fill-ups, and more

GameDay feature

Reliability score based on predefined tests

Cost

Gremlin's pricing has fluctuated over the years ranging from per-agent pricing to attacks per target to support the frequency of testing required by a team.

Should I use Gremlin?

As the world's first managed enterprise chaos engineering technology, Gremlin provides users with the ability to launch dozens of attack vectors, stop and roll back attacks, and improve system reliability. Designed with the mission of creating a sustainable and reliable internet, Gremlin pinpoints software weaknesses to minimize revenue loss and negative systematic impacts.

Conclusion

In conclusion, chaos engineering continues to play a pivotal role in ensuring the resilience and reliability of modern software systems specially distributed systems, offering organizations valuable insights and tools to proactively address potential failures and disruptions.

NOTE: I'm constantly delighted to receive feedback. Whether you spot an error, have a suggestion for improvement, or just want to share your thoughts, please don't hesitate to comment/reach out. I truly value connecting with readers!

Preface #

What is Chaos Engineering? #

Why would we break things on purpose?

What's the role of Chaos Engineering in distributed systems?

Fallacies of Distributed Systems:

Benefits of Chaos Engineering #

Organizations that benefit from Chaos Engineering

Types of Chaos Engineering Experiments #

Principles of Chaos Engineering #

Define the system’s steady-state

Create the hypothesis

Experiment by changing real-world conditions

Run automated experiments in production environments

Minimizing the blast radius

Chaos Engineering Tools #

Chaos Mesh

Pros

Cons

Key Features

Cost

Should I use Chaos Mesh?

Chaos Monkey

Pros

Cons

Key Features

Cost

Should I use Chaos Monkey?

Gremlin

Pros

Cons

Key Features

Cost

Should I use Gremlin?

Conclusion #

Preface

What is Chaos Engineering?

Benefits of Chaos Engineering

Types of Chaos Engineering Experiments

Principles of Chaos Engineering

Chaos Engineering Tools

Conclusion