Self-Supervised Learning: The Future of AI?
Yann LeCun calls self-supervised learning “the dark matter of intelligence.” While supervised learning dominates production AI, self-supervised approaches are reshaping what’s possible. Here’s why it matters.
The Label Problem
Supervised Learning Bottleneck
Training supervised models requires:
- Collect data ✓ (cheap, abundant)
- Label data ✗ (expensive, slow, error-prone)
ImageNet: 14 million images, years of labeling effort. And that’s just image classification.
The Web Has Data, Not Labels
Available on the internet:
- Billions of images
- Trillions of words
- Millions of hours of video
Labels for that data:
- Almost none
What is Self-Supervised Learning?
Self-supervised learning creates its own supervision signal from the data:
# Supervised: Need explicit labels
image → [model] → "cat" (labeled by human)
# Self-supervised: Create task from data itself
[first half of sentence] → [model] → [second half of sentence]
Key Insight
Hide part of the data. Train the model to predict it.
Approaches
Language: Masked Prediction
BERT and GPT use language’s structure:
# BERT: Predict masked words
input = "The cat sat on the [MASK]"
target = "mat"
# GPT: Predict next word
input = "The cat sat on"
target = "the"
No human labels needed—the text labels itself.
Images: Contrastive Learning
SimCLR, MoCo, BYOL learn visual representations:
# Create two views of same image
view1 = augment(image) # crop, color jitter, flip
view2 = augment(image) # different augmentation
# Train: view1 and view2 should have similar representations
# Different images should have different representations
Image → [augment] → View 1 ─┐
├→ Similar embeddings
Image → [augment] → View 2 ─┘
Images: Masked Autoencoders
MAE (2021) applies BERT’s approach to images:
Image → [mask 75% of patches] → [encoder] → [decoder] → Reconstruct masked patches
Surprisingly effective. Simple and scalable.
Video: Temporal Prediction
Frame 1, 2, 3 → [model] → Predict Frame 4
Video provides free supervision through time.
Audio: Wav2Vec
Audio waveform → [mask portions] → Predict masked audio
Same principle, different modality.
Why This Works
Learning Structure
By predicting masked content, models learn:
- Syntax and semantics (language)
- Visual patterns and objects (images)
- Physical dynamics (video)
Abundant Data
Self-supervised learning scales with data:
More data → Better representations → Better downstream performance
No labeling bottleneck.
Transfer Learning
Pre-train once, fine-tune many times:
# Step 1: Self-supervised pre-training on huge dataset
base_model = pretrain_self_supervised(internet_data)
# Step 2: Fine-tune on small labeled dataset
classifier = finetune(base_model, small_labeled_data)
Results
Language
| Model | Pre-training | Downstream Tasks |
|---|---|---|
| BERT | Masked LM | SOTA on 11 NLP tasks |
| GPT-3 | Next word | Few-shot everything |
| T5 | Span prediction | Generation + classification |
Vision
| Model | Approach | ImageNet Accuracy |
|---|---|---|
| SimCLR | Contrastive | 76.5% (linear probe) |
| BYOL | Non-contrastive | 79.6% |
| MAE | Masked autoencoder | 87.8% (fine-tuned) |
Self-supervised now matches supervised pre-training.
The Path Forward
Yann LeCun’s View
LeCun argues supervised learning is a dead end for general intelligence:
Supervised: Learn from 10^4 labeled examples
Self-supervised: Learn from 10^12 unlabeled examples
Which scales to human-level understanding?
Energy-Based Models
LeCun proposes energy-based models for prediction:
- Predict compatible futures
- Multiple valid predictions allowed
- More robust learning
World Models
Self-supervised learning enables “world models”:
Observation → [World Model] → Prediction of next state
Like humans:
- Predict consequences of actions
- Plan by imagining futures
- Learn from observation
Practical Applications (2021)
NLP
BERT, GPT, and variants dominate:
- Sentiment analysis
- Question answering
- Text generation
- Translation
Computer Vision
Still catching up, but:
- Few-shot learning improved
- Medical imaging with limited labels
- Robotics perception
Speech
Wav2Vec enables:
- Low-resource language ASR
- Accent robustness
- Multilingual transfer
Implementation
Using Pre-trained Models
from transformers import AutoModel, AutoTokenizer
# Self-supervised pre-trained model
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Get embeddings for downstream use
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
Training Your Own (Simplified)
import torch.nn as nn
class SimpleCLR(nn.Module):
def __init__(self, encoder):
super().__init__()
self.encoder = encoder
self.projector = nn.Sequential(
nn.Linear(2048, 512),
nn.ReLU(),
nn.Linear(512, 128)
)
def forward(self, x1, x2):
z1 = self.projector(self.encoder(x1))
z2 = self.projector(self.encoder(x2))
return contrastive_loss(z1, z2)
Challenges
Compute Requirements
Self-supervised models need:
- Massive datasets
- Extensive compute
- Careful hyperparameter tuning
Evaluation
How do you evaluate without labels?
- Linear probe accuracy
- Transfer learning performance
- Downstream task benchmarks
Not a Silver Bullet
Still need labeled data for:
- Domain-specific tasks
- Safety-critical applications
- Fine-grained distinctions
Final Thoughts
Self-supervised learning is how AI will learn from the world’s data. The labels were always there—in the structure of the data itself.
Watch this space. The future of AI isn’t labeled.
The best supervisor is the data itself.