OpenAI o1: The Age of Reasoning Models
ai ml
In September 2024, OpenAI released o1—a model that “thinks” before responding. Instead of immediately generating tokens, it spends time on internal reasoning. The result: dramatically better performance on complex logic, math, and code.
What’s Different
Traditional LLMs
Input → Generate tokens → Output
Time: proportional to output length
o1
Input → Reasoning (hidden) → Generate tokens → Output
Time: proportional to problem complexity + output
The model reasons internally before responding.
Benchmarks
| Task | GPT-4 | o1-preview | o1-mini |
|---|---|---|---|
| AIME (Math) | 13.4% | 74.4% | 70.0% |
| Codeforces | 11th %ile | 89th %ile | 93rd %ile |
| GPQA (Science) | 56.1% | 77.3% | 60.0% |
Massive improvements on reasoning-heavy tasks.
How It Works
Chain of Thought (Internal)
User: Solve the integral of x²e^x
o1 internally:
- Need integration by parts
- Let u = x², dv = e^x dx
- du = 2x dx, v = e^x
- x²e^x - ∫2xe^x dx
- Apply integration by parts again...
- [continues reasoning]
Output: The integral is e^x(x² - 2x + 2) + C
You don’t see the reasoning, but the model does it.
Scaling Test-Time Compute
Traditional: Better model = train more
o1: Better answer = think more
Spend compute at inference, not just training.
API Usage
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="o1-preview",
messages=[
{
"role": "user",
"content": "Prove that there are infinitely many prime numbers"
}
]
)
print(response.choices[0].message.content)
Key Differences
# o1 doesn't support:
# - system messages (use user message instead)
# - streaming (waits for full reasoning)
# - temperature (always deterministic-ish)
# Correct usage
response = client.chat.completions.create(
model="o1-preview",
messages=[
{"role": "user", "content": "Your task: ..."}
]
# No system, no temperature, no streaming
)
Model Variants
| Model | Reasoning | Speed | Cost | Use Case |
|---|---|---|---|---|
| o1-preview | Deep | Slow | $$$ | Complex problems |
| o1-mini | Focused | Medium | $$ | STEM tasks |
o1-mini is cheaper but specialized for code/math.
Wait Times
Simple question: 5-10 seconds
Math problem: 30-60 seconds
Complex code: 60-120 seconds
Reasoning takes time. Plan for latency.
When to Use o1
Good For
✅ Complex multi-step problems
✅ Mathematical proofs
✅ Code debugging (hard bugs)
✅ Scientific reasoning
✅ Strategic planning
Not Necessary For
❌ Simple Q&A
❌ Creative writing
❌ Translation
❌ Summarization
❌ Anything GPT-4 does well
Practical Examples
Debugging Complex Code
prompt = """
This code has a subtle bug causing intermittent failures.
Find and explain the issue:
def process_transactions(transactions):
total = 0
for tx in transactions:
if tx.status == 'pending':
continue
total += tx.amount
if tx.is_refund:
total -= tx.amount * 2
return total
"""
# o1 will trace through logic systematically
Mathematical Reasoning
prompt = """
Prove that for any triangle with sides a, b, c and
semiperimeter s, the area can be expressed as:
A = √(s(s-a)(s-b)(s-c))
"""
# o1 can work through geometric proofs
System Design
prompt = """
Design a rate limiter for an API that:
1. Handles 10,000 requests per second
2. Supports per-user limits and global limits
3. Works across multiple server instances
4. Gracefully handles Redis failures
Provide the complete architecture with trade-offs.
"""
# o1 considers multiple angles systematically
Cost Considerations
GPT-4 Turbo: $10/M input, $30/M output
o1-preview: $15/M input, $60/M output (+ reasoning tokens)
Reasoning tokens are billed but hidden.
A "quick" response might use 10x the visible tokens internally.
Cost Optimization
# Use cheaper models first
def solve_problem(problem):
# Try GPT-4 first
gpt4_result = ask_gpt4(problem)
if is_confident(gpt4_result):
return gpt4_result
# Escalate to o1 for hard problems
return ask_o1(problem)
Limitations
- No streaming (must wait for full response)
- No system prompts
- Higher latency
- More expensive
- Overkill for simple tasks
The Paradigm Shift
Before o1
Model quality ≈ Training scale
Better = Bigger model, more data
With o1
Answer quality ≈ Thinking time
Better = More inference compute
This enables different trade-offs.
Implications
For Complex Tasks
Before: "AI can't reliably do X"
After: "AI can do X, it just needs time to think"
For Developers
# Design for variable latency
async def handle_hard_problem(problem):
# Show "thinking" indicator
yield {"status": "thinking"}
result = await o1_solve(problem)
yield {"status": "complete", "result": result}
For Product Design
Trade-offs:
├── Speed: GPT-4 (fast, good enough)
├── Quality: o1 (slow, excellent)
└── Hybrid: Try fast, escalate if needed
Final Thoughts
o1 represents a new axis for AI improvement: test-time compute. Instead of only making models bigger, we can make them think longer.
For problems that require real reasoning, o1 is worth the wait and cost.
Sometimes the right answer takes time.