GPT-4o: Omni-Modal Capabilities

ai dev

GPT-4o (“o” for “omni”) was released in May 2024. It processes text, audio, and images natively—not as separate models stitched together. The result: faster, more natural interaction.

What’s Different

Before GPT-4o

Audio → Whisper → Text → GPT-4 → Text → TTS → Audio

3 models, significant latency

GPT-4o

Audio → GPT-4o → Audio

1 model, real-time

Native multimodal means the model understands tone, emotion, and context directly.

Capabilities

ModalityInputOutput
Text
Images❌ (description only)
Audio
Video✅ (frames)

Speed Improvements

TaskGPT-4 TurboGPT-4o
Text response~5s first token~320ms
Voice to voice~2-3s~320ms
Image understanding~3s~1s

Conversational AI became actually conversational.

API Usage

Text (Same as Before)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)

Vision

import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"
                    }
                }
            ]
        }
    ]
)

Audio (Real-time API)

# Real-time audio requires WebSocket
import websockets
import asyncio

async def realtime_conversation():
    async with websockets.connect(
        "wss://api.openai.com/v1/realtime",
        extra_headers={"Authorization": f"Bearer {api_key}"}
    ) as ws:
        # Send audio chunks
        await ws.send(audio_chunk)
        
        # Receive audio response
        response = await ws.recv()

Practical Applications

Voice Assistants

# Natural conversation with interruption handling
async def voice_assistant():
    while True:
        # GPT-4o handles:
        # - User interruptions
        # - Emotion detection
        # - Natural pauses
        # - Tone matching
        pass

Real-time Translation

User speaks English → GPT-4o → Responds in Spanish (audio)

No intermediate text step, preserves tone and nuance.

Screen Sharing AI

# Analyze screen content in real-time
def analyze_screen(screenshot):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Help user with what's on screen"},
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": screenshot}},
                    {"type": "text", "text": "What should I click next?"}
                ]
            }
        ]
    )

Video Analysis

# Extract frames and analyze
def analyze_video(video_path, frame_interval=30):
    frames = extract_frames(video_path, every_n=frame_interval)
    
    messages = [
        {"role": "system", "content": "Analyze this video sequence"},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What happens in this video?"},
                *[{"type": "image_url", "image_url": {"url": f}} for f in frames]
            ]
        }
    ]
    
    return client.chat.completions.create(model="gpt-4o", messages=messages)

Pricing

ModelInput (text)Output (text)Input (audio)Output (audio)
GPT-4o$5/1M$15/1M$100/1M$200/1M
GPT-4o-mini$0.15/1M$0.60/1M--

Audio is more expensive but replaces Whisper + TTS pipeline.

GPT-4o Mini

The efficient variant:

AspectGPT-4oGPT-4o Mini
QualityBestGood
SpeedFastFaster
Cost$5/1M$0.15/1M
Use caseComplex tasksHigh volume

30x cheaper, suitable for many production workloads.

What Users Experience

Voice Mode

Vision Mode

Building with GPT-4o

Best Practices

# 1. Use structured outputs
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[...]
)

# 2. Batch image analysis
def batch_analyze(images):
    # Include multiple images in one call
    # More efficient than separate calls
    pass

# 3. Stream for responsiveness
for chunk in client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
):
    print(chunk.choices[0].delta.content, end="")

Real-time Apps

// Browser: MediaRecorder + WebSocket
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
const recorder = new MediaRecorder(stream);

recorder.ondataavailable = async (e) => {
    ws.send(e.data);  // Send to GPT-4o
};

ws.onmessage = (e) => {
    playAudio(e.data);  // Play response
};

Implications

For Developers

For Products

For Users

Limitations

Final Thoughts

GPT-4o’s “omni” nature isn’t just a feature—it’s a paradigm shift. Multimodal AI that feels native opens new categories of applications.

The 320ms response time for voice makes AI assistants actually usable for conversation.


AI that sees, hears, and responds like a person.

All posts