Chaos Engineering: Breaking Things on Purpose

devops reliability chaos-engineering

Netflix streams to 200+ million subscribers. If their service goes down, millions of people can’t watch their shows. So Netflix invented a practice: deliberately break things in production to find weaknesses before customers do.

This is chaos engineering.

The Principle

Traditional testing asks: “Does this work?” Chaos engineering asks: “What happens when this fails?”

You know things will fail:

The question is: does your system handle it gracefully?

Chaos Monkey: The Original Tool

Netflix’s Chaos Monkey randomly terminates EC2 instances in production. Teams quickly learned to design for instance failure.

The mindset shift is powerful:

The Principles of Chaos Engineering

From the Chaos Engineering manifesto:

1. Build a Hypothesis Around Steady State

Define what “normal” looks like:

2. Vary Real-World Events

Introduce realistic failures:

3. Run Experiments in Production

Staging isn’t production. Different traffic patterns, different scale, different failure modes.

Start small. Have kill switches. But ultimately, you need production experiments.

4. Automate Experiments

Chaos should be continuous, not a one-time effort. Automate the experiments and run them regularly.

5. Minimize Blast Radius

Start with small experiments. Gradually increase scope as confidence grows.

Practical Chaos Experiments

Instance Failure

Hypothesis: Service continues operating when one instance dies. Experiment: Kill a random instance. Observe: Does load balancer route around it? Do requests succeed?

Dependency Latency

Hypothesis: Service degrades gracefully when database is slow. Experiment: Inject 5-second latency to database calls. Observe: Do timeouts trigger? Do circuit breakers open? Does the UI show appropriate messaging?

Network Partition

Hypothesis: System handles network splits between services. Experiment: Block traffic between service A and service B. Observe: Does each service continue independently? Are there cascading failures?

Resource Exhaustion

Hypothesis: Alerting triggers before disk fills completely. Experiment: Gradually fill disk to 90%. Observe: Do alerts fire? Does the system handle full disk gracefully?

Tools of the Trade

Chaos Monkey (Netflix OSS)

The original. Kills EC2 instances.

# Configure which apps/regions to target
# Set probability and schedule
# Chaos Monkey does the rest

Gremlin

Commercial platform with polished UI and many failure types:

Litmus (Kubernetes)

Cloud-native chaos for K8s:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: app-chaos
spec:
  appinfo:
    appns: default
    applabel: 'app=nginx'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete

Toxiproxy (Shopify)

Simulates network conditions:

# Create a proxy for MySQL
toxiproxy.create(name: 'mysql', listen: 'localhost:3306', upstream: 'mysql:3306')

# Add latency
toxiproxy[:mysql].downstream(:latency, latency: 1000).apply do
  # Your tests run here with 1 second latency
end

tc and iptables

Low-level Linux tools for network chaos:

# Add 100ms latency to eth0
tc qdisc add dev eth0 root netem delay 100ms

# Drop 10% of packets
tc qdisc add dev eth0 root netem loss 10%

# Block traffic to specific IP
iptables -A OUTPUT -d 10.0.0.5 -j DROP

Running Your First Experiment

Step 1: Pick a Target

Start with a non-critical service. Don’t chaos your payment system on day one.

Step 2: Define Steady State

What metrics define “working”?

Step 3: Form a Hypothesis

“When we kill one API instance, error rate stays below 1%.”

Step 4: Run the Experiment

Step 5: Analyze and Fix

Did it behave as expected? If not, you found a weakness. Fix it.

Step 6: Repeat

Keep running experiments. Add new failure scenarios. Increase blast radius over time.

Organizational Readiness

Chaos engineering requires:

Observability: You need to see what’s happening. Metrics, logs, traces.

Runbooks: When experiments reveal problems, can you respond?

Blameless Culture: Experiments will find bugs. Celebrate finding them before customers do.

Gradual Rollout: Start with gamedays, progress to automated experiments.

What Not to Do

Final Thoughts

Chaos engineering is counterintuitive. We’re taught to prevent failures, not cause them.

But controlled failures are gifts. They show you weaknesses on your terms, not during a midnight outage.

Start small. Kill one instance. See what happens. Your systems—and your confidence in them—will improve dramatically.


Break it in testing, so it doesn’t break in production.

All posts