GPT-4: Multimodal Reasoning Arrives

ai ml

On March 14, 2023, OpenAI released GPT-4. The leap from GPT-3.5 was significant: better reasoning, longer context, and vision capabilities. The bar moved again.

What’s New

Multimodal Input

GPT-4 can see:

User: [image of a refrigerator full of ingredients]
      What can I make for dinner?

GPT-4: I can see eggs, cheese, bell peppers, and spinach.
       You could make a vegetable frittata or an omelette.

Improved Reasoning

GPT-4 performs significantly better on exams:

ExamGPT-3.5GPT-4
Bar Exam10th percentile90th percentile
SAT Math590710
LSAT40th percentile88th percentile
AP Calculus24

Longer Context

ModelContext Window
GPT-3.54K tokens
GPT-48K tokens
GPT-4-32k32K tokens

That’s roughly 50 pages of text in context.

Using the API

import openai

# Text-only
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

# With image (GPT-4V)
response = openai.ChatCompletion.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ]
)

Code Capabilities

Complex Refactoring

User: Refactor this 200-line function into smaller, testable units.
      Also add type hints and docstrings.

GPT-4: [Produces well-structured refactored code with explanations]

Understanding Codebases

User: Here's a Django view, serializer, and model. Find the bug
      that causes N+1 queries.

GPT-4: The issue is in line 15 of the view. You're accessing 
       author.name in a loop without select_related. Here's the fix...

Architecture Discussions

User: Design a message queue system for a microservices architecture
      handling 1M messages/hour.

GPT-4: [Detailed architecture with trade-offs, diagrams explained,
       technology recommendations]

Limitations

Hallucinations Still Happen

User: What papers did Sangeet Verma publish in 2022?

GPT-4: [Confidently makes up paper titles and journals]

Better than GPT-3.5, but not solved.

Knowledge Cutoff

Training data still has an end date. GPT-4’s cutoff was September 2021 at launch.

Slower and More Expensive

ModelSpeedCost (input)Cost (output)
GPT-3.5-turboFast$0.0015/1K$0.002/1K
GPT-4Slower$0.03/1K$0.06/1K
GPT-4-32kSlowest$0.06/1K$0.12/1K

20x more expensive than GPT-3.5.

Developer Implications

What Changed

  1. Complex tasks become viable: Multi-step reasoning that failed with GPT-3.5 works
  2. Code understanding improves: Can handle larger, more complex codebases
  3. Vision integration: Screen reading, diagram understanding, document analysis

New Patterns

# Long document analysis
with open("contract.txt") as f:
    contract = f.read()  # Can now be 30+ pages

response = openai.ChatCompletion.create(
    model="gpt-4-32k",
    messages=[
        {"role": "user", "content": f"""
Analyze this contract for risks:

{contract}

Provide:
1. Key terms summary
2. Potential risks
3. Unusual clauses
"""}
    ]
)

Vision Use Cases

# Code review from screenshot
review = analyze_image(
    image="screenshot_of_code.png",
    prompt="Review this code for issues. What would you change?"
)

# UI testing
issues = analyze_image(
    image="app_screenshot.png",
    prompt="Identify any UI/UX issues in this interface."
)

# Architecture diagrams
explanation = analyze_image(
    image="system_diagram.png",
    prompt="Explain this system architecture and identify potential bottlenecks."
)

Compared to Open Source

CapabilityGPT-4LLaMA-65BMistral-7B
ReasoningBestGoodGood
CodingBestGoodGood
VisionYesNoNo
LocalNoYesYes
Cost$$$FreeFree

GPT-4 is best, but open models are catching up.

Practical Advice

When to Use GPT-4

When GPT-3.5 is Fine

Hybrid Approach

def get_response(query, complexity):
    if complexity == "high":
        model = "gpt-4"
    else:
        model = "gpt-3.5-turbo"
    
    return openai.ChatCompletion.create(model=model, messages=[...])

Use GPT-4 when you need it, GPT-3.5 when you don’t.

The Takeaway

GPT-4 represented a meaningful capability jump. Not quite AGI, but clearly more capable than its predecessors across almost every benchmark.

For developers, it opened new categories of applications:

The bar will keep moving. Today’s state-of-the-art is tomorrow’s baseline.


March 2023: The new benchmark was set.

All posts