Foundation Models: A Paradigm Shift

August 27, 2021

ai machine-learning

Stanford researchers coined the term “foundation models” to describe large pre-trained models like GPT-3 and BERT. It’s not just a new name—it’s a paradigm shift in how we build AI.

What Are Foundation Models?

Foundation models are:

Large: Billions of parameters
Pre-trained: On massive datasets
General: Adaptable to many tasks
Foundational: The base for downstream applications

Old Paradigm:
  Task → Collect data → Train model → Deploy
  (Repeat for each task)

Foundation Model Paradigm:
  Massive data → Train foundation model → Fine-tune for tasks → Deploy
                        ↓
               Same base, different applications

Examples

Model	Domain	Parameters	Training Data
GPT-3	Language	175B	Internet text
BERT	Language	340M	Books + Wikipedia
CLIP	Vision+Language	400M	Image-text pairs
DALL-E	Image generation	12B	Image-text pairs
Codex	Code	12B	GitHub code

Why “Foundation”?

The term emphasizes:

Centrality

Everything builds on them:

Foundation Model (GPT-3)
    ├── Chatbots
    ├── Code generation
    ├── Content writing
    ├── Summarization
    ├── Translation
    └── And more...

Homogenization

Previously: Different model for each task. Now: Same base model, different prompts/fine-tuning.

# Same model, different tasks
summarize = model("Summarize this: [text]")
translate = model("Translate to French: [text]")
code = model("Write Python code to: [task]")

Power Concentration

Few organizations can train them:

OpenAI (GPT-3, DALL-E)
Google (BERT, T5, LaMDA)
Meta (OPT, LLaMA later)
Anthropic (Claude later)

The rest fine-tune.

Emergent Capabilities

Foundation models exhibit unexpected abilities:

Few-Shot Learning

Task: Translate English to French

Example: "dog" → "chien"
Example: "cat" → "chat"
Example: "house" → ?

Model output: "maison"

No fine-tuning required—just examples in the prompt.

Chain-of-Thought

Q: If I have 3 apples and buy 2 more, then give away 1, how many do I have?

Model: Let's think step by step.
- Start with 3 apples
- Buy 2 more: 3 + 2 = 5
- Give away 1: 5 - 1 = 4
- I have 4 apples.

Reasoning emerges at scale.

Cross-Domain Transfer

CLIP connects images and text:

Image of a cat + "a photo of a cat" → High similarity
Image of a cat + "a photo of a dog" → Low similarity

Learned from image-text pairs without explicit labels.

The Emergence Story

Why Bigger = Better

Model Size	Capabilities
1M params	Basic patterns
100M params	Task-specific abilities
1B params	Few-shot learning
100B+ params	Emergent reasoning

Scaling unlocks capabilities that smaller models don’t have.

The Scaling Laws

OpenAI discovered predictable relationships:

Performance ∝ (Parameters)^α × (Data)^β × (Compute)^γ

More of each = better performance, predictably.

Adaptation Methods

Fine-Tuning

Train on task-specific data:

# Fine-tune GPT for sentiment
model.fine_tune(
    data=[
        ("Great product!", "positive"),
        ("Terrible experience", "negative"),
    ]
)

Prompting

Zero-shot task specification:

Classify the sentiment of this review as positive or negative.
Review: "The movie was absolutely fantastic!"
Sentiment:

In-Context Learning

Provide examples in prompt:

Review: "Loved it!" → Sentiment: positive
Review: "Waste of money" → Sentiment: negative
Review: "It was okay" → Sentiment:

Risks and Concerns

Bias Amplification

Models learn biases from training data:

Prompt: "The CEO walked into the room. He"
→ Model assumes male

Prompt: "The nurse walked into the room. She"
→ Model assumes female

At scale, biases spread to all applications.

Environmental Cost

Training GPT-3:

~1,000 MWh of electricity
~552 tonnes CO2
Cost: $4.6 million+

Misinformation

Models can generate convincing false content:

Prompt: "Write a news article about [false event]"
→ Realistic-looking misinformation

Homogenization Risk

If everyone uses the same base:

Same biases propagate
Same failure modes
Reduced diversity

Implications for Developers

The API Era

# Don't train—call an API
import openai

response = openai.Completion.create(
    engine="text-davinci-003",
    prompt="Generate a product description for...",
    max_tokens=200
)

Prompt Engineering

New skill: writing good prompts

# Bad prompt
"Write something about dogs"

# Good prompt
"Write a 100-word engaging blog post introduction about 
the health benefits of owning a dog. Use a friendly, 
conversational tone. Include one surprising statistic."

Fine-Tuning as Customization

# Fine-tune for your domain
model.fine_tune(
    training_data="company_documents.jsonl",
    base_model="gpt-3.5-turbo",
    epochs=3
)

The Road Ahead

Models that understand:

Text + Images (GPT-4V)
Text + Code (Codex)
Text + Audio + Video (coming)

Specialized Foundations

Domain-specific models:

Medical: Med-PaLM
Legal: Legal-BERT
Scientific: Galactica

Open Foundations

Open-source alternatives:

Meta’s OPT, LLaMA
EleutherAI’s GPT-Neo
Stability AI’s models

Final Thoughts

Foundation models are the new infrastructure of AI. Like operating systems or cloud platforms, they’re the base on which applications are built.

The paradigm shift: From training models to prompting/adapting them.

Learn to build on foundations. It’s where AI development is heading.

Stand on the shoulders of giants—billion-parameter giants.