Synthetic Data: Training Models when Web Data Runs Out
The scaling laws demand more data. But the internet is finite. We’ve scraped most of it. What now? Synthetic data—artificially generated training examples—is the emerging solution.
The Data Wall
How We Got Here
GPT-3 (2020): Trained on ~500 billion tokens GPT-4 (2023): Trained on ~trillions of tokens 2025+: Running low on new, high-quality web data
The problem:
- Most quality text already used
- Repeated training on same data degrades models
- Legal constraints limiting data sources
- Quality matters more than quantity
The Math
Available high-quality text: ~10 trillion tokens
Current model training needs: ~15+ trillion tokens
Gap: Growing
We need new data sources.
What is Synthetic Data?
Data generated by AI models to train other AI models:
Model A generates text → Text trains Model B
Types:
- Text generation: LLMs writing training data
- Instruction tuning: Generating question-answer pairs
- Reasoning chains: Creating step-by-step solutions
- Code synthesis: Generating programming examples
- Simulation: Generating edge cases and scenarios
How It Works
Self-Improvement Loop
def generate_synthetic_data(model, prompt_template, n_samples):
samples = []
for _ in range(n_samples):
# Generate problem
problem = model.generate(f"Generate a novel {prompt_template}")
# Generate solution with reasoning
solution = model.generate(f"Solve step-by-step: {problem}")
# Verify (important!)
if verify_solution(problem, solution):
samples.append((problem, solution))
return samples
Filtering and Verification
Not all synthetic data is equal:
def quality_filter(sample):
# Check for correctness (where verifiable)
if is_math_problem(sample):
return verify_math_answer(sample)
# Check for diversity (avoid repetition)
if too_similar_to_existing(sample):
return False
# Check for coherence
if not is_coherent(sample):
return False
return True
Distillation
Larger model teaches smaller model:
# Teacher generates high-quality examples
teacher_outputs = teacher_model.generate(prompts)
# Student learns from teacher
student_model.train(prompts, teacher_outputs)
Success Stories
DeepSeek R1
DeepSeek used reinforcement learning on synthetic reasoning chains:
- Model generates solutions with thinking steps
- Correct solutions become training data
- Model improves at reasoning
Phi Series (Microsoft)
“Textbooks Are All You Need”—synthetic textbook-quality data:
- Generated educational content
- Curated for coherence
- Small models, large capability
Alpaca/Vicuna
ChatGPT distillation:
- Generated instruction-following data from GPT-3.5/4
- Trained open-source models
- Demonstrated distillation works
Techniques
Evol-Instruct
Evolve prompts for diversity:
def evolve_instruction(instruction):
evolutions = [
"Make this more complex",
"Add constraints",
"Require multi-step reasoning",
"Add edge cases",
]
return model.generate(f"{random.choice(evolutions)}: {instruction}")
Self-Consistency
Generate multiple solutions, keep consistent ones:
def self_consistent_generation(problem, n=5):
solutions = [model.generate(problem) for _ in range(n)]
# Keep solutions that agree
return majority_vote(solutions)
Backtranslation
Generate in one direction, use as training for reverse:
# Generate code from description
description = "Sort a list"
code = model.generate(f"Write code: {description}")
# Use as training: code → description
training_pair = (code, description)
Constitutional AI
AI critiques and improves its own outputs:
initial = model.generate(prompt)
critique = model.generate(f"Critique this response: {initial}")
improved = model.generate(f"Improve based on critique: {initial} | {critique}")
Risks and Limitations
Model Collapse
Models trained only on synthetic data degrade:
Model 1 → generates data → trains Model 2 → generates data → trains Model 3
↓
Quality degrades
Solution: Mix synthetic with real data.
Bias Amplification
Synthetic data inherits and can amplify biases:
- Errors become training signal
- Blind spots persist
- Diversity decreases
Verification Difficulty
For subjective tasks, what’s “correct”?
- Creative writing has no ground truth
- Nuance gets lost
- Edge cases missed
Best Practices
Mix Real and Synthetic
training_data = (
real_data * 0.7 +
synthetic_data * 0.3
)
Verify Where Possible
# Math: Check answer
# Code: Run tests
# Logic: Formal verification
# Factual: Cross-reference
Maintain Diversity
Track and ensure variety:
def diversity_score(samples):
# Embedding diversity
# Topic diversity
# Style diversity
return combined_score
Monitor for Degradation
Track model performance across generations:
for generation in range(10):
model = train_on_synthetic(model, data)
score = evaluate(model, held_out_test)
if score < threshold:
add_more_real_data()
The Future
2025-2027 trends:
- Synthetic data pipelines: Standard infrastructure
- Verification at scale: Automated correctness checking
- Diverse generation: Broader, more varied outputs
- Hybrid approaches: Real + synthetic optimization
Final Thoughts
Synthetic data isn’t a cheat—it’s a tool. Used carefully, it extends training data and improves models. Used carelessly, it degrades them.
The web data wall is real. Synthetic data is one answer. But verification and diversity are non-negotiable.
When you run out of data, make more—carefully.