Chaos Engineering: Breaking Things on Purpose
Netflix streams to 200+ million subscribers. If their service goes down, millions of people can’t watch their shows. So Netflix invented a practice: deliberately break things in production to find weaknesses before customers do.
This is chaos engineering.
The Principle
Traditional testing asks: “Does this work?” Chaos engineering asks: “What happens when this fails?”
You know things will fail:
- Servers crash
- Networks partition
- Dependencies time out
- Disks fill up
- Memory leaks
The question is: does your system handle it gracefully?
Chaos Monkey: The Original Tool
Netflix’s Chaos Monkey randomly terminates EC2 instances in production. Teams quickly learned to design for instance failure.
The mindset shift is powerful:
- Before: “I hope our instances don’t crash”
- After: “Our system works fine when instances crash”
The Principles of Chaos Engineering
From the Chaos Engineering manifesto:
1. Build a Hypothesis Around Steady State
Define what “normal” looks like:
- Request latency < 200ms
- Error rate < 0.1%
- Orders processing successfully
2. Vary Real-World Events
Introduce realistic failures:
- Kill instances
- Inject latency
- Drop network packets
- Fill disk space
- Return errors from dependencies
3. Run Experiments in Production
Staging isn’t production. Different traffic patterns, different scale, different failure modes.
Start small. Have kill switches. But ultimately, you need production experiments.
4. Automate Experiments
Chaos should be continuous, not a one-time effort. Automate the experiments and run them regularly.
5. Minimize Blast Radius
Start with small experiments. Gradually increase scope as confidence grows.
Practical Chaos Experiments
Instance Failure
Hypothesis: Service continues operating when one instance dies. Experiment: Kill a random instance. Observe: Does load balancer route around it? Do requests succeed?
Dependency Latency
Hypothesis: Service degrades gracefully when database is slow. Experiment: Inject 5-second latency to database calls. Observe: Do timeouts trigger? Do circuit breakers open? Does the UI show appropriate messaging?
Network Partition
Hypothesis: System handles network splits between services. Experiment: Block traffic between service A and service B. Observe: Does each service continue independently? Are there cascading failures?
Resource Exhaustion
Hypothesis: Alerting triggers before disk fills completely. Experiment: Gradually fill disk to 90%. Observe: Do alerts fire? Does the system handle full disk gracefully?
Tools of the Trade
Chaos Monkey (Netflix OSS)
The original. Kills EC2 instances.
# Configure which apps/regions to target
# Set probability and schedule
# Chaos Monkey does the rest
Gremlin
Commercial platform with polished UI and many failure types:
- CPU/Memory/Disk attacks
- Network attacks (latency, packet loss, blackhole)
- State attacks (process kill, time travel)
Litmus (Kubernetes)
Cloud-native chaos for K8s:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: app-chaos
spec:
appinfo:
appns: default
applabel: 'app=nginx'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
Toxiproxy (Shopify)
Simulates network conditions:
# Create a proxy for MySQL
toxiproxy.create(name: 'mysql', listen: 'localhost:3306', upstream: 'mysql:3306')
# Add latency
toxiproxy[:mysql].downstream(:latency, latency: 1000).apply do
# Your tests run here with 1 second latency
end
tc and iptables
Low-level Linux tools for network chaos:
# Add 100ms latency to eth0
tc qdisc add dev eth0 root netem delay 100ms
# Drop 10% of packets
tc qdisc add dev eth0 root netem loss 10%
# Block traffic to specific IP
iptables -A OUTPUT -d 10.0.0.5 -j DROP
Running Your First Experiment
Step 1: Pick a Target
Start with a non-critical service. Don’t chaos your payment system on day one.
Step 2: Define Steady State
What metrics define “working”?
- Error rate
- Latency p99
- Throughput
- Business metrics
Step 3: Form a Hypothesis
“When we kill one API instance, error rate stays below 1%.”
Step 4: Run the Experiment
- Have rollback ready
- Monitor in real-time
- Start during low-traffic periods
Step 5: Analyze and Fix
Did it behave as expected? If not, you found a weakness. Fix it.
Step 6: Repeat
Keep running experiments. Add new failure scenarios. Increase blast radius over time.
Organizational Readiness
Chaos engineering requires:
Observability: You need to see what’s happening. Metrics, logs, traces.
Runbooks: When experiments reveal problems, can you respond?
Blameless Culture: Experiments will find bugs. Celebrate finding them before customers do.
Gradual Rollout: Start with gamedays, progress to automated experiments.
What Not to Do
- Don’t start in production without practice: Use staging first
- Don’t surprise your team: Everyone should know experiments are running
- Don’t skip the hypothesis: Random destruction isn’t engineering
- Don’t ignore the results: Fix what you find
Final Thoughts
Chaos engineering is counterintuitive. We’re taught to prevent failures, not cause them.
But controlled failures are gifts. They show you weaknesses on your terms, not during a midnight outage.
Start small. Kill one instance. See what happens. Your systems—and your confidence in them—will improve dramatically.
Break it in testing, so it doesn’t break in production.