Whisper: OpenAI's Speech Recognition

ai machine-learning

OpenAI released Whisper—a speech recognition model that’s shockingly good. And unlike most OpenAI releases, it’s open source. Here’s what it can do.

What Is Whisper

Whisper is a general-purpose speech recognition model:

Audio input → Whisper → Text transcription

          Optional: Translation to English

Why It’s Impressive

Robustness

Whisper handles:

It saw 680,000 hours of training data from the internet.

Out-of-the-Box Quality

No fine-tuning needed for most use cases. It just works.

Installation

pip install openai-whisper

# Or with pip (latest)
pip install git+https://github.com/openai/whisper.git

Requires FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu
sudo apt install ffmpeg

Basic Usage

Command Line

# Transcribe
whisper audio.mp3

# Specify model size
whisper audio.mp3 --model medium

# Translate to English
whisper audio.mp3 --task translate

# Specify language (optional, auto-detects)
whisper audio.mp3 --language Japanese

Python API

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

With Timestamps

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")

Output:

[0.00 - 2.50] Hello, welcome to the podcast.
[2.50 - 5.00] Today we're discussing machine learning.

Model Sizes

ModelParametersVRAMSpeedQuality
tiny39M~1GBFastOK
base74M~1GBFastGood
small244M~2GBMediumBetter
medium769M~5GBSlowGreat
large1550M~10GBSlowestBest

For most use cases, base or small is sufficient.

Practical Applications

Meeting Transcription

import whisper

def transcribe_meeting(audio_path):
    model = whisper.load_model("medium")
    result = model.transcribe(audio_path)
    
    # Save as SRT subtitles
    with open("meeting.srt", "w") as f:
        for i, seg in enumerate(result["segments"]):
            f.write(f"{i+1}\n")
            f.write(f"{format_time(seg['start'])} --> {format_time(seg['end'])}\n")
            f.write(f"{seg['text'].strip()}\n\n")

def format_time(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def index_podcast(audio_path):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)
    
    # Index segments for search
    indexed = []
    for seg in result["segments"]:
        indexed.append({
            "text": seg["text"],
            "start": seg["start"],
            "embedding": embed_text(seg["text"])  # For semantic search
        })
    
    return indexed

Voice Notes to Text

import whisper
from pathlib import Path

def transcribe_voice_notes(directory):
    model = whisper.load_model("small")
    
    for audio_file in Path(directory).glob("*.m4a"):
        result = model.transcribe(str(audio_file))
        
        text_file = audio_file.with_suffix(".txt")
        text_file.write_text(result["text"])
        print(f"Transcribed: {audio_file.name}")

Performance Optimization

GPU Acceleration

import whisper
import torch

# Check CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("medium", device=device)

Faster Whisper (CTranslate2)

pip install faster-whisper
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

4x faster than original implementation.

Batch Processing

import concurrent.futures
from pathlib import Path

def transcribe_file(file_path):
    model = whisper.load_model("base")
    return model.transcribe(str(file_path))

files = list(Path("audio/").glob("*.mp3"))

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(transcribe_file, files))

Limitations

What It Struggles With

Speaker Diarization

Whisper doesn’t identify WHO is speaking. For that, combine with:

# pyannote for speaker diarization
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{speaker} speaks from {turn.start:.1f}s to {turn.end:.1f}s")

Comparison

ToolOpen SourceLocalQualitySpeed
WhisperYesYesExcellentMedium
Google SpeechNoNoExcellentFast
AWS TranscribeNoNoGoodFast
AssemblyAINoNoExcellentFast

Whisper wins on: open source, local processing, no API costs.

Final Thoughts

Whisper democratizes high-quality speech recognition. Run it locally, no API calls, no costs after hardware.

For most transcription tasks, it’s good enough—and for some, it’s the best option available.

Install it. Try it. You’ll be surprised.


Speech recognition that just works, running on your own hardware.

All posts