Synthetic Data: Training Models when Web Data Runs Out

June 12, 2025

ai machine-learning data

The scaling laws demand more data. But the internet is finite. We’ve scraped most of it. What now? Synthetic data—artificially generated training examples—is the emerging solution.

The Data Wall

How We Got Here

GPT-3 (2020): Trained on ~500 billion tokens GPT-4 (2023): Trained on ~trillions of tokens 2025+: Running low on new, high-quality web data

The problem:

Most quality text already used
Repeated training on same data degrades models
Legal constraints limiting data sources
Quality matters more than quantity

The Math

Available high-quality text: ~10 trillion tokens
Current model training needs: ~15+ trillion tokens
Gap: Growing

We need new data sources.

What is Synthetic Data?

Data generated by AI models to train other AI models:

Model A generates text → Text trains Model B

Types:

Text generation: LLMs writing training data
Instruction tuning: Generating question-answer pairs
Reasoning chains: Creating step-by-step solutions
Code synthesis: Generating programming examples
Simulation: Generating edge cases and scenarios

How It Works

Self-Improvement Loop

def generate_synthetic_data(model, prompt_template, n_samples):
    samples = []
    for _ in range(n_samples):
        # Generate problem
        problem = model.generate(f"Generate a novel {prompt_template}")
        
        # Generate solution with reasoning
        solution = model.generate(f"Solve step-by-step: {problem}")
        
        # Verify (important!)
        if verify_solution(problem, solution):
            samples.append((problem, solution))
    
    return samples

Filtering and Verification

Not all synthetic data is equal:

def quality_filter(sample):
    # Check for correctness (where verifiable)
    if is_math_problem(sample):
        return verify_math_answer(sample)
    
    # Check for diversity (avoid repetition)
    if too_similar_to_existing(sample):
        return False
    
    # Check for coherence
    if not is_coherent(sample):
        return False
    
    return True

Distillation

Larger model teaches smaller model:

# Teacher generates high-quality examples
teacher_outputs = teacher_model.generate(prompts)

# Student learns from teacher
student_model.train(prompts, teacher_outputs)

Success Stories

DeepSeek R1

DeepSeek used reinforcement learning on synthetic reasoning chains:

Model generates solutions with thinking steps
Correct solutions become training data
Model improves at reasoning

Phi Series (Microsoft)

“Textbooks Are All You Need”—synthetic textbook-quality data:

Generated educational content
Curated for coherence
Small models, large capability

Alpaca/Vicuna

ChatGPT distillation:

Generated instruction-following data from GPT-3.5/4
Trained open-source models
Demonstrated distillation works

Techniques

Evol-Instruct

Evolve prompts for diversity:

def evolve_instruction(instruction):
    evolutions = [
        "Make this more complex",
        "Add constraints",
        "Require multi-step reasoning",
        "Add edge cases",
    ]
    return model.generate(f"{random.choice(evolutions)}: {instruction}")

Self-Consistency

Generate multiple solutions, keep consistent ones:

def self_consistent_generation(problem, n=5):
    solutions = [model.generate(problem) for _ in range(n)]
    # Keep solutions that agree
    return majority_vote(solutions)

Backtranslation

Generate in one direction, use as training for reverse:

# Generate code from description
description = "Sort a list"
code = model.generate(f"Write code: {description}")

# Use as training: code → description
training_pair = (code, description)

Constitutional AI

AI critiques and improves its own outputs:

initial = model.generate(prompt)
critique = model.generate(f"Critique this response: {initial}")
improved = model.generate(f"Improve based on critique: {initial} | {critique}")

Risks and Limitations

Model Collapse

Models trained only on synthetic data degrade:

Model 1 → generates data → trains Model 2 → generates data → trains Model 3
                                                                    ↓
                                                             Quality degrades

Solution: Mix synthetic with real data.

Bias Amplification

Synthetic data inherits and can amplify biases:

Errors become training signal
Blind spots persist
Diversity decreases

Verification Difficulty

For subjective tasks, what’s “correct”?

Creative writing has no ground truth
Nuance gets lost
Edge cases missed

Best Practices

Mix Real and Synthetic

training_data = (
    real_data * 0.7 +
    synthetic_data * 0.3
)

Verify Where Possible

# Math: Check answer
# Code: Run tests
# Logic: Formal verification
# Factual: Cross-reference

Maintain Diversity

Track and ensure variety:

def diversity_score(samples):
    # Embedding diversity
    # Topic diversity
    # Style diversity
    return combined_score

Monitor for Degradation

Track model performance across generations:

for generation in range(10):
    model = train_on_synthetic(model, data)
    score = evaluate(model, held_out_test)
    if score < threshold:
        add_more_real_data()

The Future

2025-2027 trends:

Synthetic data pipelines: Standard infrastructure
Verification at scale: Automated correctness checking
Diverse generation: Broader, more varied outputs
Hybrid approaches: Real + synthetic optimization

Final Thoughts

Synthetic data isn’t a cheat—it’s a tool. Used carefully, it extends training data and improves models. Used carelessly, it degrades them.

The web data wall is real. Synthetic data is one answer. But verification and diversity are non-negotiable.

When you run out of data, make more—carefully.