Responding to a “system down” emergency is an IT professional’s nightmare. At the time that an application is offline, there are stress, cost, and urgency factors.
Planning can help your organization prevent such an emergency, but what is the right amount of planning versus the cost involved? In a complex landscape, there are considerations for cloud, hybrid cloud, and on-premise environments.
Cloud environments, in particular, have a cost associated with infrastructure. So what is the right about of testing for a failure scenario?
Chaos engineering is meant to take traditional testing and intentionally try to break your environment at the exact point where it is vulnerable. By identifying and controlling your experiments, you can apply chaos testing to reduce risks in your production environment and ultimately lower your cloud spending.
Let’s look at what is meant by chaos engineering and how it can be put into practice.
What Is Chaos Engineering?
First and foremost: chaos engineering is not a replacement for your QA testing. Instead, it is meant to supplement your testing by applying scenarios that cannot be simulated in unit or integration tests.
You are essentially “stressing” your environment to test its resiliency and increase your confidence in your application. It is meant to help developers and software engineers move past common fallacies, such as network reliability and security, bandwidth, and latency. You may think you’ve designed your system to withstand all scenarios, but without testing, how do you know if everything is working as intended?
Chaos engineering is about identifying and building controlled experiments within your environment and reviewing how your system responds. If problems are identified, you can correct them.
You can leverage your cloud environment for chaos engineering. If you run your tests in production, you are posing a huge risk to your users. Better to start in a sandbox environment in the cloud where you can conduct your tests first.
Principles of Chaos Engineering
Cloud computing has had a two-fold impact on the concept of chaos engineering. First, cloud environments have a level of built-in uncertainty that makes chaos engineering necessary. And at the same time, cloud environments make it easier to test your scenarios because of their scalability.
According to principlesofchaos.org, you should design your chaos engineering around the following principles.
Building a Hypothesis Around Steady-State Behavior
You need to know how your system functions under normal circumstances. This creates your baseline around which you will apply your chaos engineering. Unlike other testing, chaos engineering will verify that your system works rather than focusing on how it works.
Real-World Event Variation
You need to define real-world scenarios that could happen in your environment. Think about both the impact and frequency. This will help you prioritize your chaos engineering.
Running Experiments in Production
While you want to start by testing in a sandbox environment, eventually, you will need to introduce your chaos engineering into production. However, you’ll want to minimize the blast radius so you don’t cause unnecessary pain for your users.
You may find issues in your sandbox environment and fix them. You need to be assured that the fix will also apply to production.
Automating Experiments to Run Continuously
Once you have defined and developed your test scenarios, automate them. Running manual experiments is both labor-intensive and unsustainable. Fortunately, cloud environments are well-suited for automation.
Best Practices for Chaos Engineering
The name alone implies that chaos engineering is “uncontrolled.” In practice, the opposite is true. Instead, you are applying systematic experiments and analyzing the results.
If you follow best practices, you can maximize the benefits of your chaos engineering. You can prepare for the alerts needed in the event of a failure. If you don’t follow best practices, it can lead to increased costs and insights that are not helpful.
A container is a small version, or segment, of your environment. You can deploy experiments in isolation and avoid a lot of disruption. You can attack a single container and create more containers as needed.
Manual vs. Automated Testing
You can perform a dry-run of your tests manually in a simulated environment. This allows for more control and closer monitoring. Once you have completed the simulation, then you can move to testing in a more relevant environment.
Only when you have completed your testing manually should you move to automated testing.
Known vs. Unknown Testing
It is one thing to test for known scenarios. It is quite another to test for the unknown. Unknown scenarios may include elements that you are aware of but don’t understand the impact.
For example, you may know what impact a short amount of downtime would have. You may not know the effect of a total system shutdown or cyberattack. Your chaos engineering should test for both knowns and unknowns.
Have Your Backups Ready
You need to be prepared that your controlled chaos may require you to recreate your environment. Prepare for this with backups so that you can restore quickly and complete additional experiments.
Reducing Cloud Spending With Chaos Engineering
Overall, chaos engineering will reduce your cloud spending. As more and more businesses turn to cloud computing for the future, there will be an increased focus on costs.
You will need to consider the associated costs of cloud resources for chaos engineering. However, compare this to the business costs associated with an outage or other issue.
While the necessary testing environment will incur additional costs, you will gain in other cost savings. For example, with chaos engineering, you can:
- Determine the size of infrastructure needed and balance between idle and demand on resources.
- Determine the right redundancies needed, depending on the type of outage.
- Find unused or ineffective resources that could be increasing your overall costs.
Because of the inherent costs associated with cloud infrastructure, you need to be mindful during your testing. You need to determine the minimum amount of testing required to achieve the intended result. Proper planning will ensure you do not over-allocate cloud resources or incur unnecessary costs.
The Right Tools and Alerts for Monitoring Your Environment
Chaos engineering allows you to run the tests to find your system weaknesses. But how can you ensure that you have the right monitoring and alerts in place?
Your chaos engineering and live environments should have real-time monitoring, along with the appropriate alerts. If your team is not made immediately aware of the problem, you cannot respond.