Self-Supervised Learning: The Future of AI?

March 27, 2021

ai machine-learning

Yann LeCun calls self-supervised learning “the dark matter of intelligence.” While supervised learning dominates production AI, self-supervised approaches are reshaping what’s possible. Here’s why it matters.

The Label Problem

Supervised Learning Bottleneck

Training supervised models requires:
- Collect data ✓ (cheap, abundant)
- Label data ✗ (expensive, slow, error-prone)

ImageNet: 14 million images, years of labeling effort. And that’s just image classification.

The Web Has Data, Not Labels

Available on the internet:
- Billions of images
- Trillions of words
- Millions of hours of video

Labels for that data:
- Almost none

What is Self-Supervised Learning?

Self-supervised learning creates its own supervision signal from the data:

# Supervised: Need explicit labels
image → [model] → "cat" (labeled by human)

# Self-supervised: Create task from data itself
[first half of sentence] → [model] → [second half of sentence]

Key Insight

Hide part of the data. Train the model to predict it.

Approaches

Language: Masked Prediction

BERT and GPT use language’s structure:

# BERT: Predict masked words
input = "The cat sat on the [MASK]"
target = "mat"

# GPT: Predict next word
input = "The cat sat on"
target = "the"

No human labels needed—the text labels itself.

Images: Contrastive Learning

SimCLR, MoCo, BYOL learn visual representations:

# Create two views of same image
view1 = augment(image)  # crop, color jitter, flip
view2 = augment(image)  # different augmentation

# Train: view1 and view2 should have similar representations
# Different images should have different representations

Image → [augment] → View 1 ─┐
                            ├→ Similar embeddings
Image → [augment] → View 2 ─┘

Images: Masked Autoencoders

MAE (2021) applies BERT’s approach to images:

Image → [mask 75% of patches] → [encoder] → [decoder] → Reconstruct masked patches

Surprisingly effective. Simple and scalable.

Video: Temporal Prediction

Frame 1, 2, 3 → [model] → Predict Frame 4

Video provides free supervision through time.

Audio: Wav2Vec

Audio waveform → [mask portions] → Predict masked audio

Same principle, different modality.

Why This Works

Learning Structure

By predicting masked content, models learn:

Syntax and semantics (language)
Visual patterns and objects (images)
Physical dynamics (video)

Abundant Data

Self-supervised learning scales with data:

More data → Better representations → Better downstream performance

No labeling bottleneck.

Transfer Learning

Pre-train once, fine-tune many times:

# Step 1: Self-supervised pre-training on huge dataset
base_model = pretrain_self_supervised(internet_data)

# Step 2: Fine-tune on small labeled dataset
classifier = finetune(base_model, small_labeled_data)

Results

Language

Model	Pre-training	Downstream Tasks
BERT	Masked LM	SOTA on 11 NLP tasks
GPT-3	Next word	Few-shot everything
T5	Span prediction	Generation + classification

Vision

Model	Approach	ImageNet Accuracy
SimCLR	Contrastive	76.5% (linear probe)
BYOL	Non-contrastive	79.6%
MAE	Masked autoencoder	87.8% (fine-tuned)

Self-supervised now matches supervised pre-training.

The Path Forward

Yann LeCun’s View

LeCun argues supervised learning is a dead end for general intelligence:

Supervised: Learn from 10^4 labeled examples
Self-supervised: Learn from 10^12 unlabeled examples

Which scales to human-level understanding?

Energy-Based Models

LeCun proposes energy-based models for prediction:

Predict compatible futures
Multiple valid predictions allowed
More robust learning

World Models

Self-supervised learning enables “world models”:

Observation → [World Model] → Prediction of next state

Like humans:
- Predict consequences of actions
- Plan by imagining futures
- Learn from observation

Practical Applications (2021)

NLP

BERT, GPT, and variants dominate:

Sentiment analysis
Question answering
Text generation
Translation

Computer Vision

Still catching up, but:

Few-shot learning improved
Medical imaging with limited labels
Robotics perception

Speech

Wav2Vec enables:

Low-resource language ASR
Accent robustness
Multilingual transfer

Implementation

Using Pre-trained Models

from transformers import AutoModel, AutoTokenizer

# Self-supervised pre-trained model
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Get embeddings for downstream use
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

Training Your Own (Simplified)

import torch.nn as nn

class SimpleCLR(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder
        self.projector = nn.Sequential(
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Linear(512, 128)
        )
    
    def forward(self, x1, x2):
        z1 = self.projector(self.encoder(x1))
        z2 = self.projector(self.encoder(x2))
        return contrastive_loss(z1, z2)

Challenges

Compute Requirements

Self-supervised models need:

Massive datasets
Extensive compute
Careful hyperparameter tuning

Evaluation

How do you evaluate without labels?

Linear probe accuracy
Transfer learning performance
Downstream task benchmarks

Not a Silver Bullet

Still need labeled data for:

Domain-specific tasks
Safety-critical applications
Fine-grained distinctions

Final Thoughts

Self-supervised learning is how AI will learn from the world’s data. The labels were always there—in the structure of the data itself.

Watch this space. The future of AI isn’t labeled.

The best supervisor is the data itself.