Chaos Engineering: Breaking Things on Purpose

July 28, 2018

infrastructure devops

Breaking things on purpose sounds counterintuitive, but controlled failure is how we build resilient systems. Chaos engineering surfaces the hidden assumptions in your architecture before they cause outages.

This post covers how to start with chaos engineering without breaking production.

The Historical Context

To understand where we are, we need to understand where we’ve been. The DevOps ecosystem has evolved significantly over the past decade, responding to changing requirements and lessons learned from production systems.

Chaos Engineering: Breaking Things on Purpose didn’t emerge in isolation. It’s the result of collective experience—countless hours of debugging, scaling, and refactoring. Every major advancement in our field builds on the frustrations and insights of practitioners who came before.

This progression reflects the maturation of our industry. We’re moving from ad-hoc solutions to principled approaches, from reactive firefighting to proactive architecture.

Strategic Implications

Netflix tooling principles entering wider enterprise. This is more than just a technical detail—it’s about operational efficiency and leverage. When evaluating new technology, I ask three questions:

Does it reduce cognitive load for the team?
Does it improve velocity in the long run?
Is the ecosystem stable enough to bet our business on?

Chaos Engineering: Breaking Things on Purpose deserves evaluation against these criteria. The answer isn’t always obvious, and it depends heavily on your specific context.

A Deep Dive into the Mechanics

Let’s get technical. What’s actually happening under the hood?

At its heart, this concept relies on a few fundamental principles of computer science that we often take for granted. Concepts like idempotency, immutability, and separation of concerns are front and center here.

When implemented correctly, it allows for a level of decoupling that we’ve struggled to achieve with previous generations of tooling. But beware: this power comes with complexity. If you’re not careful, you can easily over-engineer your solution, creating a Rube Goldberg machine that is impossible to debug.

Simplicity and Concurrency

Go’s approach to concurrency is a perfect example of primitive simplicity. It doesn’t rely on complex thread management or callbacks.

package main

import (
    "fmt"
    "time"
)

func worker(id int, jobs <-chan int, results chan<- int) {
    for j := range jobs {
        fmt.Println("worker", id, "started  job", j)
        time.Sleep(time.Second) // Simulate expensive task
        fmt.Println("worker", id, "finished job", j)
        results <- j * 2
    }
}

func main() {
    const numJobs = 5
    jobs := make(chan int, numJobs)
    results := make(chan int, numJobs)

    // Spin up 3 workers
    for w := 1; w <= 3; w++ {
        go worker(w, jobs, results)
    }

    for j := 1; j <= numJobs; j++ {
        jobs <- j
    }
    close(jobs)

    for a := 1; a <= numJobs; a++ {
        <-results
    }
}

This pattern scales. It’s understandable. It’s maintainable. In a DevOps context, this reliability is paramount.

Common Pitfalls

The biggest DevOps pitfall is tooling without culture. You can’t buy DevOps—you have to build it. Tools enable practices, but practices require human investment.

Another common mistake is over-automating before understanding the process. Automate what you already do well. Don’t automate chaos.

Start with the pain points, not the blog posts.

Final Thoughts

Breaking things on purpose sounds counterintuitive, but it’s how we build truly resilient systems. Chaos engineering reveals the assumptions hiding in your architecture. Start small—you don’t need a full GameDay to learn from controlled failure.

Keep building. Keep learning.