Deep Learning's Hardware Lottery
A 2020 paper coined the term “hardware lottery”—the idea that winning machine learning algorithms are often those that run efficiently on available hardware. This explains more about AI’s trajectory than we might like to admit.
The Thesis
“Whether a research idea succeeds or fails often has more to do with whether it fits onto existing hardware than whether it’s the best approach.”
Transformers didn’t just win because attention is better than recurrence. They won because matrix multiplications map perfectly to GPUs.
Historical Examples
Neural Networks’ “Death”
1990s-2000s: Neural networks were considered dead.
Backpropagation → Complex → Too slow on CPUs
SVMs → Kernel trick → Efficient on CPUs
↓
SVMs "won"
Then GPUs became available for general computing:
Neural networks + GPUs → Fast again → Deep learning revolution
The math never changed. The hardware did.
Transformers vs RNNs
RNNs process sequences step-by-step:
Token 1 → Hidden state → Token 2 → Hidden state → ...
Inherently sequential. Hard to parallelize.
Transformers use attention:
All tokens → Matrix multiplication → Output
Massively parallel. Perfect for GPUs.
RNN on GPU: Limited speedup (sequential dependency)
Transformer: Massive speedup (parallelizable)
Transformers map better to available hardware.
GPU Architecture Matters
What GPUs Do Well
| Operation | GPU Speed |
|---|---|
| Matrix multiply | Very fast |
| Element-wise ops | Very fast |
| Random access | Slow |
| Branching | Slow |
| Memory-bound ops | Bottleneck |
What This Rewards
Algorithms that work best:
- Dense matrix operations (Transformers ✓)
- Batch processing (Transformers ✓)
- Regular memory access patterns (Transformers ✓)
Algorithms that struggle:
- Sparse operations (Graph networks, dynamic architectures)
- Sequential dependencies (RNNs, some reinforcement learning)
- Memory-intensive operations (Large vocabulary models)
The Implications
Research is Biased
Researchers gravitate toward what works on available hardware:
Researcher has: NVIDIA GPUs
Researcher tries: Dense architectures, Transformers
Researcher succeeds: More papers on Transformers
Researcher doesn't try: Sparse architectures
Community: "Transformers are the best!"
The search space is constrained by hardware.
Alternative Ideas Get Abandoned
Interesting approaches that don’t map to GPUs:
- Spiking neural networks
- Neuromorphic computing
- Sparse mixture of experts (until recently)
- Memory-augmented networks
These might be better—we don’t know because they’re too slow to test at scale.
Infrastructure Lock-in
We’ve invested billions in:
- NVIDIA ecosystem
- CUDA software stack
- Transformer-optimized TPUs
Switching to different architectures means:
- Rewriting software
- Redesigning hardware
- Lost efficiency from specialization
Emerging Alternatives
Specialized AI Chips
| Hardware | Optimized For |
|---|---|
| TPU | Matrix ops, Transformers |
| Cerebras | Large models, sparse |
| Graphcore | Graph operations |
| Groq | Inference |
More diverse hardware could enable more diverse algorithms.
Sparse Architectures
Recent work on efficient Transformers:
# Dense attention: O(n²)
attention = softmax(Q @ K.T) @ V
# Sparse attention: O(n * k)
# Only attend to k < n tokens
attention = sparse_softmax(Q @ K[selected].T) @ V[selected]
Hardware support for sparsity is improving.
Mixture of Experts
Input → Router → Expert 1 (active)
→ Expert 2 (inactive)
→ Expert 3 (active)
→ ...
Only a few experts activate per input. More efficient if hardware supports it.
What Should Change
1. Diverse Hardware Investment
Don’t put all resources into one architecture. Keep alternatives viable.
2. Hardware-Agnostic Benchmarks
Measure algorithms on theoretical compute, not just wall-clock time on GPUs.
3. Simulation Over Hardware
Invest in simulators that can test novel architectures before building chips.
4. Acknowledge the Bias
When claiming “SOTA,” acknowledge the hardware assumptions. Results might not generalize.
The Lesson
AI progress isn’t purely about algorithmic innovation. It’s about:
Algorithm × Hardware × Scale = Success
The algorithms that win are often those that best exploit current hardware. That’s not the same as being the best algorithms.
Final Thoughts
The hardware lottery explains:
- Why transformers dominate (GPU-friendly)
- Why alternatives seem “fringe” (can’t test at scale)
- Why AI progress might be hitting walls (hardware limits)
As we approach the limits of current architectures, new hardware could unlock new algorithmic frontiers—or we could be stuck waiting for the next hardware shift.
Your favorite AI might just be a lottery winner.
The best ideas don’t always win. The ones that fit the hardware do.