GPT-4o: Omni-Modal Capabilities

July 7, 2024

ai dev

GPT-4o (“o” for “omni”) was released in May 2024. It processes text, audio, and images natively—not as separate models stitched together. The result: faster, more natural interaction.

What’s Different

Before GPT-4o

Audio → Whisper → Text → GPT-4 → Text → TTS → Audio

3 models, significant latency

GPT-4o

Audio → GPT-4o → Audio

1 model, real-time

Native multimodal means the model understands tone, emotion, and context directly.

Capabilities

Modality	Input	Output
Text	✅	✅
Images	✅	❌ (description only)
Audio	✅	✅
Video	✅ (frames)	❌

Speed Improvements

Task	GPT-4 Turbo	GPT-4o
Text response	~5s first token	~320ms
Voice to voice	~2-3s	~320ms
Image understanding	~3s	~1s

Conversational AI became actually conversational.

API Usage

Text (Same as Before)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)

Vision

import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"
                    }
                }
            ]
        }
    ]
)

Audio (Real-time API)

# Real-time audio requires WebSocket
import websockets
import asyncio

async def realtime_conversation():
    async with websockets.connect(
        "wss://api.openai.com/v1/realtime",
        extra_headers={"Authorization": f"Bearer {api_key}"}
    ) as ws:
        # Send audio chunks
        await ws.send(audio_chunk)
        
        # Receive audio response
        response = await ws.recv()

Practical Applications

Voice Assistants

# Natural conversation with interruption handling
async def voice_assistant():
    while True:
        # GPT-4o handles:
        # - User interruptions
        # - Emotion detection
        # - Natural pauses
        # - Tone matching
        pass

Real-time Translation

User speaks English → GPT-4o → Responds in Spanish (audio)

No intermediate text step, preserves tone and nuance.

# Analyze screen content in real-time
def analyze_screen(screenshot):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Help user with what's on screen"},
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": screenshot}},
                    {"type": "text", "text": "What should I click next?"}
                ]
            }
        ]
    )

Video Analysis

# Extract frames and analyze
def analyze_video(video_path, frame_interval=30):
    frames = extract_frames(video_path, every_n=frame_interval)
    
    messages = [
        {"role": "system", "content": "Analyze this video sequence"},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What happens in this video?"},
                *[{"type": "image_url", "image_url": {"url": f}} for f in frames]
            ]
        }
    ]
    
    return client.chat.completions.create(model="gpt-4o", messages=messages)

Pricing

Model	Input (text)	Output (text)	Input (audio)	Output (audio)
GPT-4o	$5/1M	$15/1M	$100/1M	$200/1M
GPT-4o-mini	$0.15/1M	$0.60/1M	-	-

Audio is more expensive but replaces Whisper + TTS pipeline.

GPT-4o Mini

The efficient variant:

Aspect	GPT-4o	GPT-4o Mini
Quality	Best	Good
Speed	Fast	Faster
Cost	$5/1M	$0.15/1M
Use case	Complex tasks	High volume

30x cheaper, suitable for many production workloads.

What Users Experience

Voice Mode

Interrupt mid-sentence
Laugh, sigh—model responds appropriately
Natural pacing and pauses
Emotion-aware responses

Vision Mode

Describe photos
Read and understand documents
Analyze charts and graphs
UI/UX feedback on screenshots

Building with GPT-4o

Best Practices

# 1. Use structured outputs
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[...]
)

# 2. Batch image analysis
def batch_analyze(images):
    # Include multiple images in one call
    # More efficient than separate calls
    pass

# 3. Stream for responsiveness
for chunk in client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
):
    print(chunk.choices[0].delta.content, end="")

Real-time Apps

// Browser: MediaRecorder + WebSocket
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
const recorder = new MediaRecorder(stream);

recorder.ondataavailable = async (e) => {
    ws.send(e.data);  // Send to GPT-4o
};

ws.onmessage = (e) => {
    playAudio(e.data);  // Play response
};

Implications

For Developers

Voice interfaces become viable
Real-time multimodal apps possible
Simpler architecture (one model)

For Products

Conversational AI that feels natural
Accessibility improvements
New interaction paradigms

For Users

Talk to AI like a person
Share what you see, get help
Faster, more natural responses

Limitations

Audio output not available in all regions
No real-time video (frame-by-frame works)
Rate limits on audio endpoints
Higher latency for audio than text

Final Thoughts

GPT-4o’s “omni” nature isn’t just a feature—it’s a paradigm shift. Multimodal AI that feels native opens new categories of applications.

The 320ms response time for voice makes AI assistants actually usable for conversation.

AI that sees, hears, and responds like a person.