Deep Learning's Hardware Lottery

ai machine-learning

A 2020 paper coined the term “hardware lottery”—the idea that winning machine learning algorithms are often those that run efficiently on available hardware. This explains more about AI’s trajectory than we might like to admit.

The Thesis

“Whether a research idea succeeds or fails often has more to do with whether it fits onto existing hardware than whether it’s the best approach.”

Transformers didn’t just win because attention is better than recurrence. They won because matrix multiplications map perfectly to GPUs.

Historical Examples

Neural Networks’ “Death”

1990s-2000s: Neural networks were considered dead.

Backpropagation → Complex       → Too slow on CPUs
SVMs             → Kernel trick → Efficient on CPUs

                    SVMs "won"

Then GPUs became available for general computing:

Neural networks + GPUs → Fast again → Deep learning revolution

The math never changed. The hardware did.

Transformers vs RNNs

RNNs process sequences step-by-step:

Token 1 → Hidden state → Token 2 → Hidden state → ...

Inherently sequential. Hard to parallelize.

Transformers use attention:

All tokens → Matrix multiplication → Output

Massively parallel. Perfect for GPUs.

RNN on GPU:     Limited speedup (sequential dependency)
Transformer:    Massive speedup (parallelizable)

Transformers map better to available hardware.

GPU Architecture Matters

What GPUs Do Well

OperationGPU Speed
Matrix multiplyVery fast
Element-wise opsVery fast
Random accessSlow
BranchingSlow
Memory-bound opsBottleneck

What This Rewards

Algorithms that work best:

Algorithms that struggle:

The Implications

Research is Biased

Researchers gravitate toward what works on available hardware:

Researcher has: NVIDIA GPUs
Researcher tries: Dense architectures, Transformers
Researcher succeeds: More papers on Transformers
Researcher doesn't try: Sparse architectures
Community: "Transformers are the best!"

The search space is constrained by hardware.

Alternative Ideas Get Abandoned

Interesting approaches that don’t map to GPUs:

These might be better—we don’t know because they’re too slow to test at scale.

Infrastructure Lock-in

We’ve invested billions in:

Switching to different architectures means:

Emerging Alternatives

Specialized AI Chips

HardwareOptimized For
TPUMatrix ops, Transformers
CerebrasLarge models, sparse
GraphcoreGraph operations
GroqInference

More diverse hardware could enable more diverse algorithms.

Sparse Architectures

Recent work on efficient Transformers:

# Dense attention: O(n²)
attention = softmax(Q @ K.T) @ V

# Sparse attention: O(n * k)
# Only attend to k < n tokens
attention = sparse_softmax(Q @ K[selected].T) @ V[selected]

Hardware support for sparsity is improving.

Mixture of Experts

Input → Router → Expert 1 (active)
              → Expert 2 (inactive)
              → Expert 3 (active)
              → ...

Only a few experts activate per input. More efficient if hardware supports it.

What Should Change

1. Diverse Hardware Investment

Don’t put all resources into one architecture. Keep alternatives viable.

2. Hardware-Agnostic Benchmarks

Measure algorithms on theoretical compute, not just wall-clock time on GPUs.

3. Simulation Over Hardware

Invest in simulators that can test novel architectures before building chips.

4. Acknowledge the Bias

When claiming “SOTA,” acknowledge the hardware assumptions. Results might not generalize.

The Lesson

AI progress isn’t purely about algorithmic innovation. It’s about:

Algorithm × Hardware × Scale = Success

The algorithms that win are often those that best exploit current hardware. That’s not the same as being the best algorithms.

Final Thoughts

The hardware lottery explains:

As we approach the limits of current architectures, new hardware could unlock new algorithmic frontiers—or we could be stuck waiting for the next hardware shift.

Your favorite AI might just be a lottery winner.


The best ideas don’t always win. The ones that fit the hardware do.

All posts