Whisper: OpenAI's Speech Recognition

September 10, 2022

ai machine-learning

OpenAI released Whisper—a speech recognition model that’s shockingly good. And unlike most OpenAI releases, it’s open source. Here’s what it can do.

What Is Whisper

Whisper is a general-purpose speech recognition model:

Multilingual: 99 languages
Translation: Any language → English
Transcription: Audio → text
Open source: MIT licensed weights

Audio input → Whisper → Text transcription
                ↓
          Optional: Translation to English

Why It’s Impressive

Robustness

Whisper handles:

Background noise
Accents
Technical jargon
Multiple speakers (reasonably)
Music with vocals

It saw 680,000 hours of training data from the internet.

Out-of-the-Box Quality

No fine-tuning needed for most use cases. It just works.

Installation

pip install openai-whisper

# Or with pip (latest)
pip install git+https://github.com/openai/whisper.git

Requires FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu
sudo apt install ffmpeg

Basic Usage

Command Line

# Transcribe
whisper audio.mp3

# Specify model size
whisper audio.mp3 --model medium

# Translate to English
whisper audio.mp3 --task translate

# Specify language (optional, auto-detects)
whisper audio.mp3 --language Japanese

Python API

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

With Timestamps

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")

Output:

[0.00 - 2.50] Hello, welcome to the podcast.
[2.50 - 5.00] Today we're discussing machine learning.

Model Sizes

Model	Parameters	VRAM	Speed	Quality
tiny	39M	~1GB	Fast	OK
base	74M	~1GB	Fast	Good
small	244M	~2GB	Medium	Better
medium	769M	~5GB	Slow	Great
large	1550M	~10GB	Slowest	Best

For most use cases, base or small is sufficient.

Practical Applications

Meeting Transcription

import whisper

def transcribe_meeting(audio_path):
    model = whisper.load_model("medium")
    result = model.transcribe(audio_path)
    
    # Save as SRT subtitles
    with open("meeting.srt", "w") as f:
        for i, seg in enumerate(result["segments"]):
            f.write(f"{i+1}\n")
            f.write(f"{format_time(seg['start'])} --> {format_time(seg['end'])}\n")
            f.write(f"{seg['text'].strip()}\n\n")

def format_time(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

Podcast Search

def index_podcast(audio_path):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)
    
    # Index segments for search
    indexed = []
    for seg in result["segments"]:
        indexed.append({
            "text": seg["text"],
            "start": seg["start"],
            "embedding": embed_text(seg["text"])  # For semantic search
        })
    
    return indexed

Voice Notes to Text

import whisper
from pathlib import Path

def transcribe_voice_notes(directory):
    model = whisper.load_model("small")
    
    for audio_file in Path(directory).glob("*.m4a"):
        result = model.transcribe(str(audio_file))
        
        text_file = audio_file.with_suffix(".txt")
        text_file.write_text(result["text"])
        print(f"Transcribed: {audio_file.name}")

Performance Optimization

GPU Acceleration

import whisper
import torch

# Check CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("medium", device=device)

Faster Whisper (CTranslate2)

pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

4x faster than original implementation.

Batch Processing

import concurrent.futures
from pathlib import Path

def transcribe_file(file_path):
    model = whisper.load_model("base")
    return model.transcribe(str(file_path))

files = list(Path("audio/").glob("*.mp3"))

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(transcribe_file, files))

Limitations

What It Struggles With

Very long audio (memory limits)
Heavily overlapping speakers
Extremely noisy environments
Technical domain-specific vocabulary (sometimes)

Speaker Diarization

Whisper doesn’t identify WHO is speaking. For that, combine with:

# pyannote for speaker diarization
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{speaker} speaks from {turn.start:.1f}s to {turn.end:.1f}s")

Comparison

Tool	Open Source	Local	Quality	Speed
Whisper	Yes	Yes	Excellent	Medium
Google Speech	No	No	Excellent	Fast
AWS Transcribe	No	No	Good	Fast
AssemblyAI	No	No	Excellent	Fast

Whisper wins on: open source, local processing, no API costs.

Final Thoughts

Whisper democratizes high-quality speech recognition. Run it locally, no API calls, no costs after hardware.

For most transcription tasks, it’s good enough—and for some, it’s the best option available.

Install it. Try it. You’ll be surprised.

Speech recognition that just works, running on your own hardware.