GPT-4: Multimodal Reasoning Arrives

March 27, 2023

ai ml

On March 14, 2023, OpenAI released GPT-4. The leap from GPT-3.5 was significant: better reasoning, longer context, and vision capabilities. The bar moved again.

What’s New

Multimodal Input

GPT-4 can see:

User: [image of a refrigerator full of ingredients]
      What can I make for dinner?

GPT-4: I can see eggs, cheese, bell peppers, and spinach.
       You could make a vegetable frittata or an omelette.

Improved Reasoning

GPT-4 performs significantly better on exams:

Exam	GPT-3.5	GPT-4
Bar Exam	10th percentile	90th percentile
SAT Math	590	710
LSAT	40th percentile	88th percentile
AP Calculus	2	4

Longer Context

Model	Context Window
GPT-3.5	4K tokens
GPT-4	8K tokens
GPT-4-32k	32K tokens

That’s roughly 50 pages of text in context.

Using the API

import openai

# Text-only
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

# With image (GPT-4V)
response = openai.ChatCompletion.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ]
)

Code Capabilities

Complex Refactoring

User: Refactor this 200-line function into smaller, testable units.
      Also add type hints and docstrings.

GPT-4: [Produces well-structured refactored code with explanations]

Understanding Codebases

User: Here's a Django view, serializer, and model. Find the bug
      that causes N+1 queries.

GPT-4: The issue is in line 15 of the view. You're accessing 
       author.name in a loop without select_related. Here's the fix...

Architecture Discussions

User: Design a message queue system for a microservices architecture
      handling 1M messages/hour.

GPT-4: [Detailed architecture with trade-offs, diagrams explained,
       technology recommendations]

Limitations

Hallucinations Still Happen

User: What papers did Sangeet Verma publish in 2022?

GPT-4: [Confidently makes up paper titles and journals]

Better than GPT-3.5, but not solved.

Knowledge Cutoff

Training data still has an end date. GPT-4’s cutoff was September 2021 at launch.

Slower and More Expensive

Model	Speed	Cost (input)	Cost (output)
GPT-3.5-turbo	Fast	$0.0015/1K	$0.002/1K
GPT-4	Slower	$0.03/1K	$0.06/1K
GPT-4-32k	Slowest	$0.06/1K	$0.12/1K

20x more expensive than GPT-3.5.

Developer Implications

What Changed

Complex tasks become viable: Multi-step reasoning that failed with GPT-3.5 works
Code understanding improves: Can handle larger, more complex codebases
Vision integration: Screen reading, diagram understanding, document analysis

New Patterns

# Long document analysis
with open("contract.txt") as f:
    contract = f.read()  # Can now be 30+ pages

response = openai.ChatCompletion.create(
    model="gpt-4-32k",
    messages=[
        {"role": "user", "content": f"""
Analyze this contract for risks:

{contract}

Provide:
1. Key terms summary
2. Potential risks
3. Unusual clauses
"""}
    ]
)

Vision Use Cases

# Code review from screenshot
review = analyze_image(
    image="screenshot_of_code.png",
    prompt="Review this code for issues. What would you change?"
)

# UI testing
issues = analyze_image(
    image="app_screenshot.png",
    prompt="Identify any UI/UX issues in this interface."
)

# Architecture diagrams
explanation = analyze_image(
    image="system_diagram.png",
    prompt="Explain this system architecture and identify potential bottlenecks."
)

Compared to Open Source

Capability	GPT-4	LLaMA-65B	Mistral-7B
Reasoning	Best	Good	Good
Coding	Best	Good	Good
Vision	Yes	No	No
Local	No	Yes	Yes
Cost	$$$	Free	Free

GPT-4 is best, but open models are catching up.

Practical Advice

When to Use GPT-4

Complex reasoning tasks
Code generation requiring understanding
Vision-based analysis
When quality matters more than cost

When GPT-3.5 is Fine

Simple Q&A
Straightforward text generation
High-volume, low-stakes applications
When cost matters

Hybrid Approach

def get_response(query, complexity):
    if complexity == "high":
        model = "gpt-4"
    else:
        model = "gpt-3.5-turbo"
    
    return openai.ChatCompletion.create(model=model, messages=[...])

Use GPT-4 when you need it, GPT-3.5 when you don’t.

The Takeaway

GPT-4 represented a meaningful capability jump. Not quite AGI, but clearly more capable than its predecessors across almost every benchmark.

For developers, it opened new categories of applications:

Document analysis at scale
Vision-based automation
Complex multi-step AI workflows

The bar will keep moving. Today’s state-of-the-art is tomorrow’s baseline.

March 2023: The new benchmark was set.