OpenAI o1: The Age of Reasoning Models

November 11, 2024

ai ml

In September 2024, OpenAI released o1—a model that “thinks” before responding. Instead of immediately generating tokens, it spends time on internal reasoning. The result: dramatically better performance on complex logic, math, and code.

What’s Different

Traditional LLMs

Input → Generate tokens → Output
Time: proportional to output length

o1

Input → Reasoning (hidden) → Generate tokens → Output
Time: proportional to problem complexity + output

The model reasons internally before responding.

Benchmarks

Task	GPT-4	o1-preview	o1-mini
AIME (Math)	13.4%	74.4%	70.0%
Codeforces	11th %ile	89th %ile	93rd %ile
GPQA (Science)	56.1%	77.3%	60.0%

Massive improvements on reasoning-heavy tasks.

How It Works

Chain of Thought (Internal)

User: Solve the integral of x²e^x

o1 internally:
- Need integration by parts
- Let u = x², dv = e^x dx
- du = 2x dx, v = e^x
- x²e^x - ∫2xe^x dx
- Apply integration by parts again...
- [continues reasoning]

Output: The integral is e^x(x² - 2x + 2) + C

You don’t see the reasoning, but the model does it.

Scaling Test-Time Compute

Traditional: Better model = train more
o1:          Better answer = think more

Spend compute at inference, not just training.

API Usage

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="o1-preview",
    messages=[
        {
            "role": "user",
            "content": "Prove that there are infinitely many prime numbers"
        }
    ]
)

print(response.choices[0].message.content)

Key Differences

# o1 doesn't support:
# - system messages (use user message instead)
# - streaming (waits for full reasoning)
# - temperature (always deterministic-ish)

# Correct usage
response = client.chat.completions.create(
    model="o1-preview",
    messages=[
        {"role": "user", "content": "Your task: ..."}
    ]
    # No system, no temperature, no streaming
)

Model Variants

Model	Reasoning	Speed	Cost	Use Case
o1-preview	Deep	Slow	$$$	Complex problems
o1-mini	Focused	Medium	$$	STEM tasks

o1-mini is cheaper but specialized for code/math.

Wait Times

Simple question:  5-10 seconds
Math problem:     30-60 seconds
Complex code:     60-120 seconds

Reasoning takes time. Plan for latency.

When to Use o1

Good For

✅ Complex multi-step problems
✅ Mathematical proofs
✅ Code debugging (hard bugs)
✅ Scientific reasoning
✅ Strategic planning

Not Necessary For

❌ Simple Q&A
❌ Creative writing
❌ Translation
❌ Summarization
❌ Anything GPT-4 does well

Practical Examples

Debugging Complex Code

prompt = """
This code has a subtle bug causing intermittent failures.
Find and explain the issue:

def process_transactions(transactions):
    total = 0
    for tx in transactions:
        if tx.status == 'pending':
            continue
        total += tx.amount
        if tx.is_refund:
            total -= tx.amount * 2
    return total
"""

# o1 will trace through logic systematically

Mathematical Reasoning

prompt = """
Prove that for any triangle with sides a, b, c and 
semiperimeter s, the area can be expressed as:
A = √(s(s-a)(s-b)(s-c))
"""

# o1 can work through geometric proofs

System Design

prompt = """
Design a rate limiter for an API that:
1. Handles 10,000 requests per second
2. Supports per-user limits and global limits
3. Works across multiple server instances
4. Gracefully handles Redis failures

Provide the complete architecture with trade-offs.
"""

# o1 considers multiple angles systematically

Cost Considerations

GPT-4 Turbo: $10/M input, $30/M output
o1-preview:  $15/M input, $60/M output (+ reasoning tokens)

Reasoning tokens are billed but hidden.
A "quick" response might use 10x the visible tokens internally.

Cost Optimization

# Use cheaper models first
def solve_problem(problem):
    # Try GPT-4 first
    gpt4_result = ask_gpt4(problem)
    
    if is_confident(gpt4_result):
        return gpt4_result
    
    # Escalate to o1 for hard problems
    return ask_o1(problem)

Limitations

No streaming (must wait for full response)
No system prompts
Higher latency
More expensive
Overkill for simple tasks

The Paradigm Shift

Before o1

Model quality ≈ Training scale
Better = Bigger model, more data

With o1

Answer quality ≈ Thinking time
Better = More inference compute

This enables different trade-offs.

Implications

For Complex Tasks

Before: "AI can't reliably do X"
After:  "AI can do X, it just needs time to think"

For Developers

# Design for variable latency
async def handle_hard_problem(problem):
    # Show "thinking" indicator
    yield {"status": "thinking"}
    
    result = await o1_solve(problem)
    
    yield {"status": "complete", "result": result}

For Product Design

Trade-offs:
├── Speed: GPT-4 (fast, good enough)
├── Quality: o1 (slow, excellent)
└── Hybrid: Try fast, escalate if needed

Final Thoughts

o1 represents a new axis for AI improvement: test-time compute. Instead of only making models bigger, we can make them think longer.

For problems that require real reasoning, o1 is worth the wait and cost.

Sometimes the right answer takes time.