The CrowdStrike Outage: A Lesson in Deployment Rings
On July 19, 2024, a CrowdStrike Falcon sensor update caused Windows machines worldwide to blue screen. Airlines grounded flights. Hospitals went manual. The incident became the largest IT outage in history. Here’s what happened and what we learn from it.
What Happened
Timeline
04:09 UTC: CrowdStrike releases channel file update
04:09-05:27 UTC: Systems receiving update begin crashing
05:27 UTC: CrowdStrike reverts the update
~8.5 million Windows devices affected
The Bug
A content update for Falcon’s kernel-level sensor contained a logic error that caused a null pointer exception—in kernel space, that means Blue Screen of Death (BSOD).
Why Kernel Level
Falcon operates at kernel level to:
├── Detect rootkits
├── Monitor system calls
├── Prevent tampering
└── See everything
But kernel bugs = system crash
Why It Was So Bad
Global Simultaneous Deployment
Traditional rollout:
├── Canary (1%) ← Catch issues here
├── Early adopters (5%)
├── Production (25%)
└── Full (100%)
CrowdStrike rollout:
└── Everyone (100%) ← All at once
All machines got the update within ~78 minutes.
Kernel-Mode Failure
User-mode crash: Application dies
Kernel-mode crash: System dies
Falcon runs in kernel mode.
No Automated Rollback
Normal software: Crash → Rollback → Works
Kernel crash: Crash → Can't boot → Can't rollback
Recovery required manual intervention on each machine.
The Recovery
To fix each machine:
1. Boot into Safe Mode
2. Navigate to C:\Windows\System32\drivers\CrowdStrike
3. Delete "C-00000291*.sys"
4. Reboot
5. Repeat 8.5 million times
With BitLocker encryption: need recovery key first. Organizations without centralized key management: bigger problem.
Deployment Ring Patterns
What CrowdStrike Should Have Done
Ring 0: Internal (1 hour)
├── CrowdStrike's own systems
└── Catch obvious crashes
Ring 1: Canary (4 hours)
├── Opt-in early adopters
└── ~0.1% of fleet
Ring 2: Early (24 hours)
├── Mix of customer types
└── ~5% of fleet
Ring 3: Broad (48 hours)
├── General availability
└── ~50% of fleet
Ring 4: Full (96 hours)
└── All remaining systems
Microsoft’s Ring Model
Microsoft uses for Windows updates:
├── Dev Ring: Daily builds (internal)
├── Canary Ring: Weekly (volunteers)
├── Dev Channel: Biweekly (enthusiasts)
├── Beta Channel: Monthly (early adopters)
├── Release Preview: Pre-release (IT pros)
└── General Availability: Everyone
Feature Flags
# Control rollout percentage
def should_apply_update(machine_id: str) -> bool:
rollout_percentage = get_config("channel_file_update_rollout")
# Deterministic but distributed
hash_value = hash(machine_id) % 100
return hash_value < rollout_percentage
Automated Rollback
For Applications
# Kubernetes: Automatic rollback
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
progressDeadlineSeconds: 600 # Fail if not healthy in 10 min
For Kernel Modules
# Boot fallback (if implemented)
# 1. Keep previous known-good version
# 2. Boot counter: if crash within 5 min, don't load new module
# 3. Automatic recovery possible
Windows has some of this with “Last Known Good Configuration” but it wasn’t sufficient.
Monitoring for Fast Detection
# Crash rate monitoring
def monitor_crash_rate():
current_rate = get_crash_rate(last_15_minutes)
baseline = get_baseline_crash_rate()
if current_rate > baseline * 10: # 10x normal
alert("Abnormal crash rate detected")
return True
return False
# Automatic halt
def deploy_update(version):
start_rollout(version, percentage=1) # Canary
wait(minutes=15)
if monitor_crash_rate():
rollback(version)
return "FAILED"
# Continue rings...
Organizational Lessons
1. Test in Prod-Like Environments
Dev environment: 10 machines
Production: 8.5 million machines
That's 850,000x difference.
2. Have a Rollback Plan
Question: What happens if this update crashes systems?
Answer: We delete the file manually on each machine.
Question: How long does that take for 8.5 million machines?
Answer: ...oh.
3. Blast Radius Limits
Never deploy to 100% at once.
Cap at 1% for initial validation.
Monitor before expanding.
4. Off-band Recovery
If the update mechanism breaks, how do we fix it?
Answer should not be "boot into safe mode manually."
For Your Systems
Checklist
[ ] Deployment rings defined?
[ ] Canary percentage <1%?
[ ] Automated monitoring for rollout?
[ ] Automatic rollback triggers?
[ ] Manual rollback documented?
[ ] Recovery doesn't require the system to boot?
[ ] BitLocker/encryption keys accessible?
Ring Implementation
# Kubernetes example with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: {duration: 1h}
- analysis:
templates:
- templateName: error-rate
- setWeight: 10
- pause: {duration: 4h}
- setWeight: 50
- pause: {duration: 24h}
- setWeight: 100
Final Thoughts
CrowdStrike’s outage wasn’t caused by a sophisticated attack or an unavoidable edge case. It was caused by:
- Not using deployment rings
- Not testing the specific file that crashed
- Not having fast automated detection
- Not having automated rollback
These are solved problems. Every deployment system should have these safeguards.
The cost of careful deployment is hours. The cost of reckless deployment is billions.
Deploy carefully. The blast radius matters.