Whisper: OpenAI's Speech Recognition
OpenAI released Whisper—a speech recognition model that’s shockingly good. And unlike most OpenAI releases, it’s open source. Here’s what it can do.
What Is Whisper
Whisper is a general-purpose speech recognition model:
- Multilingual: 99 languages
- Translation: Any language → English
- Transcription: Audio → text
- Open source: MIT licensed weights
Audio input → Whisper → Text transcription
↓
Optional: Translation to English
Why It’s Impressive
Robustness
Whisper handles:
- Background noise
- Accents
- Technical jargon
- Multiple speakers (reasonably)
- Music with vocals
It saw 680,000 hours of training data from the internet.
Out-of-the-Box Quality
No fine-tuning needed for most use cases. It just works.
Installation
pip install openai-whisper
# Or with pip (latest)
pip install git+https://github.com/openai/whisper.git
Requires FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu
sudo apt install ffmpeg
Basic Usage
Command Line
# Transcribe
whisper audio.mp3
# Specify model size
whisper audio.mp3 --model medium
# Translate to English
whisper audio.mp3 --task translate
# Specify language (optional, auto-detects)
whisper audio.mp3 --language Japanese
Python API
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
With Timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
Output:
[0.00 - 2.50] Hello, welcome to the podcast.
[2.50 - 5.00] Today we're discussing machine learning.
Model Sizes
| Model | Parameters | VRAM | Speed | Quality |
|---|---|---|---|---|
| tiny | 39M | ~1GB | Fast | OK |
| base | 74M | ~1GB | Fast | Good |
| small | 244M | ~2GB | Medium | Better |
| medium | 769M | ~5GB | Slow | Great |
| large | 1550M | ~10GB | Slowest | Best |
For most use cases, base or small is sufficient.
Practical Applications
Meeting Transcription
import whisper
def transcribe_meeting(audio_path):
model = whisper.load_model("medium")
result = model.transcribe(audio_path)
# Save as SRT subtitles
with open("meeting.srt", "w") as f:
for i, seg in enumerate(result["segments"]):
f.write(f"{i+1}\n")
f.write(f"{format_time(seg['start'])} --> {format_time(seg['end'])}\n")
f.write(f"{seg['text'].strip()}\n\n")
def format_time(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
Podcast Search
def index_podcast(audio_path):
model = whisper.load_model("base")
result = model.transcribe(audio_path)
# Index segments for search
indexed = []
for seg in result["segments"]:
indexed.append({
"text": seg["text"],
"start": seg["start"],
"embedding": embed_text(seg["text"]) # For semantic search
})
return indexed
Voice Notes to Text
import whisper
from pathlib import Path
def transcribe_voice_notes(directory):
model = whisper.load_model("small")
for audio_file in Path(directory).glob("*.m4a"):
result = model.transcribe(str(audio_file))
text_file = audio_file.with_suffix(".txt")
text_file.write_text(result["text"])
print(f"Transcribed: {audio_file.name}")
Performance Optimization
GPU Acceleration
import whisper
import torch
# Check CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("medium", device=device)
Faster Whisper (CTranslate2)
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")
for segment in segments:
print(f"[{segment.start:.2f}s] {segment.text}")
4x faster than original implementation.
Batch Processing
import concurrent.futures
from pathlib import Path
def transcribe_file(file_path):
model = whisper.load_model("base")
return model.transcribe(str(file_path))
files = list(Path("audio/").glob("*.mp3"))
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(transcribe_file, files))
Limitations
What It Struggles With
- Very long audio (memory limits)
- Heavily overlapping speakers
- Extremely noisy environments
- Technical domain-specific vocabulary (sometimes)
Speaker Diarization
Whisper doesn’t identify WHO is speaking. For that, combine with:
# pyannote for speaker diarization
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
diarization = pipeline("audio.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{speaker} speaks from {turn.start:.1f}s to {turn.end:.1f}s")
Comparison
| Tool | Open Source | Local | Quality | Speed |
|---|---|---|---|---|
| Whisper | Yes | Yes | Excellent | Medium |
| Google Speech | No | No | Excellent | Fast |
| AWS Transcribe | No | No | Good | Fast |
| AssemblyAI | No | No | Excellent | Fast |
Whisper wins on: open source, local processing, no API costs.
Final Thoughts
Whisper democratizes high-quality speech recognition. Run it locally, no API calls, no costs after hardware.
For most transcription tasks, it’s good enough—and for some, it’s the best option available.
Install it. Try it. You’ll be surprised.
Speech recognition that just works, running on your own hardware.