GPT-4o: Omni-Modal Capabilities
ai dev
GPT-4o (“o” for “omni”) was released in May 2024. It processes text, audio, and images natively—not as separate models stitched together. The result: faster, more natural interaction.
What’s Different
Before GPT-4o
Audio → Whisper → Text → GPT-4 → Text → TTS → Audio
3 models, significant latency
GPT-4o
Audio → GPT-4o → Audio
1 model, real-time
Native multimodal means the model understands tone, emotion, and context directly.
Capabilities
| Modality | Input | Output |
|---|---|---|
| Text | ✅ | ✅ |
| Images | ✅ | ❌ (description only) |
| Audio | ✅ | ✅ |
| Video | ✅ (frames) | ❌ |
Speed Improvements
| Task | GPT-4 Turbo | GPT-4o |
|---|---|---|
| Text response | ~5s first token | ~320ms |
| Voice to voice | ~2-3s | ~320ms |
| Image understanding | ~3s | ~1s |
Conversational AI became actually conversational.
API Usage
Text (Same as Before)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "Explain quantum computing"}
]
)
Vision
import base64
def encode_image(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"
}
}
]
}
]
)
Audio (Real-time API)
# Real-time audio requires WebSocket
import websockets
import asyncio
async def realtime_conversation():
async with websockets.connect(
"wss://api.openai.com/v1/realtime",
extra_headers={"Authorization": f"Bearer {api_key}"}
) as ws:
# Send audio chunks
await ws.send(audio_chunk)
# Receive audio response
response = await ws.recv()
Practical Applications
Voice Assistants
# Natural conversation with interruption handling
async def voice_assistant():
while True:
# GPT-4o handles:
# - User interruptions
# - Emotion detection
# - Natural pauses
# - Tone matching
pass
Real-time Translation
User speaks English → GPT-4o → Responds in Spanish (audio)
No intermediate text step, preserves tone and nuance.
Screen Sharing AI
# Analyze screen content in real-time
def analyze_screen(screenshot):
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Help user with what's on screen"},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": screenshot}},
{"type": "text", "text": "What should I click next?"}
]
}
]
)
Video Analysis
# Extract frames and analyze
def analyze_video(video_path, frame_interval=30):
frames = extract_frames(video_path, every_n=frame_interval)
messages = [
{"role": "system", "content": "Analyze this video sequence"},
{
"role": "user",
"content": [
{"type": "text", "text": "What happens in this video?"},
*[{"type": "image_url", "image_url": {"url": f}} for f in frames]
]
}
]
return client.chat.completions.create(model="gpt-4o", messages=messages)
Pricing
| Model | Input (text) | Output (text) | Input (audio) | Output (audio) |
|---|---|---|---|---|
| GPT-4o | $5/1M | $15/1M | $100/1M | $200/1M |
| GPT-4o-mini | $0.15/1M | $0.60/1M | - | - |
Audio is more expensive but replaces Whisper + TTS pipeline.
GPT-4o Mini
The efficient variant:
| Aspect | GPT-4o | GPT-4o Mini |
|---|---|---|
| Quality | Best | Good |
| Speed | Fast | Faster |
| Cost | $5/1M | $0.15/1M |
| Use case | Complex tasks | High volume |
30x cheaper, suitable for many production workloads.
What Users Experience
Voice Mode
- Interrupt mid-sentence
- Laugh, sigh—model responds appropriately
- Natural pacing and pauses
- Emotion-aware responses
Vision Mode
- Describe photos
- Read and understand documents
- Analyze charts and graphs
- UI/UX feedback on screenshots
Building with GPT-4o
Best Practices
# 1. Use structured outputs
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[...]
)
# 2. Batch image analysis
def batch_analyze(images):
# Include multiple images in one call
# More efficient than separate calls
pass
# 3. Stream for responsiveness
for chunk in client.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True
):
print(chunk.choices[0].delta.content, end="")
Real-time Apps
// Browser: MediaRecorder + WebSocket
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
const recorder = new MediaRecorder(stream);
recorder.ondataavailable = async (e) => {
ws.send(e.data); // Send to GPT-4o
};
ws.onmessage = (e) => {
playAudio(e.data); // Play response
};
Implications
For Developers
- Voice interfaces become viable
- Real-time multimodal apps possible
- Simpler architecture (one model)
For Products
- Conversational AI that feels natural
- Accessibility improvements
- New interaction paradigms
For Users
- Talk to AI like a person
- Share what you see, get help
- Faster, more natural responses
Limitations
- Audio output not available in all regions
- No real-time video (frame-by-frame works)
- Rate limits on audio endpoints
- Higher latency for audio than text
Final Thoughts
GPT-4o’s “omni” nature isn’t just a feature—it’s a paradigm shift. Multimodal AI that feels native opens new categories of applications.
The 320ms response time for voice makes AI assistants actually usable for conversation.
AI that sees, hears, and responds like a person.